ICaRus: Identical Cache Reuse for Efficient Multi Model Inference
arXiv:2603.13281v1 Announce Type: new Abstract: Multi model inference has recently emerged as a prominent paradigm, particularly in the development of agentic AI systems. However, in such scenarios, each model must maintain its own Key-Value (KV) cache for the identical prompt, leading to substantial memory consumption. This explosive growth of KV caches forces LLM serving systems to evict previously stored caches, which in turn introduces significant recomputation overhead whenever the evicted caches are required again. Moreover, prefix caching is inherently infeasible across different models, forcing each model to recompute KV cache for the identical prompt, which leads to significant overhead. To alleviate these issues, we propose Identical Cache Reuse (ICaRus), a novel architecture that allows multiple models to share identical KV caches across all layers. ICaRus is based on the key observation that a decoder-only Transformer can be conceptually decomposed into a logical encoder,
arXiv:2603.13281v1 Announce Type: new Abstract: Multi model inference has recently emerged as a prominent paradigm, particularly in the development of agentic AI systems. However, in such scenarios, each model must maintain its own Key-Value (KV) cache for the identical prompt, leading to substantial memory consumption. This explosive growth of KV caches forces LLM serving systems to evict previously stored caches, which in turn introduces significant recomputation overhead whenever the evicted caches are required again. Moreover, prefix caching is inherently infeasible across different models, forcing each model to recompute KV cache for the identical prompt, which leads to significant overhead. To alleviate these issues, we propose Identical Cache Reuse (ICaRus), a novel architecture that allows multiple models to share identical KV caches across all layers. ICaRus is based on the key observation that a decoder-only Transformer can be conceptually decomposed into a logical encoder, which generates KV caches, and a logical decoder, which predicts output tokens from the KV caches. ICaRus fine-tunes only the logical decoder while freezing the logical encoder, enabling multiple models to share an identical KV cache. This eliminates cache memory explosion and unexpected evictions while also allowing cross-model reuse of KV caches for new input tokens, thereby removing redundant recomputation in multi model inference achieving both efficiency and scalability. Moreover, by incorporating lightweight adapters such as LoRA, ICaRus parallelizes KV cache generation and next-token prediction during decoding. ICaRus achieves comparable accuracy to task-specific fine-tuned model across a diverse set of tasks, while allowing multiple specialized models to fully share KV caches. ICaRus achieves up to 11.1x lower P95 latency and 3.8x higher throughput in multi agent workflow with 8 different models, compared to conventional multi model system.
Executive Summary
The article ICaRus: Identical Cache Reuse for Efficient Multi Model Inference proposes a novel architecture to address memory consumption issues in multi model inference systems. ICaRus decomposes the decoder-only Transformer into a logical encoder and decoder, allowing multiple models to share an identical Key-Value (KV) cache across all layers. By fine-tuning only the logical decoder while freezing the logical encoder, ICaRus eliminates cache memory explosion and unexpected evictions. This approach achieves comparable accuracy to task-specific fine-tuned models, while allowing multiple specialized models to fully share KV caches. ICaRus demonstrates up to 11.1x lower P95 latency and 3.8x higher throughput in multi agent workflows compared to conventional multi model systems. This advancement has significant implications for the development of agentic AI systems and highlights the importance of efficient caching mechanisms in multi model inference.
Key Points
- ▸ ICaRus decomposes the decoder-only Transformer into a logical encoder and decoder
- ▸ Multiple models can share an identical Key-Value (KV) cache across all layers
- ▸ ICaRus achieves comparable accuracy to task-specific fine-tuned models
- ▸ ICaRus demonstrates significant improvements in latency and throughput in multi agent workflows
Merits
Efficient caching mechanism
ICaRus eliminates cache memory explosion and unexpected evictions, leading to significant improvements in latency and throughput.
Scalability
ICaRus allows multiple specialized models to fully share KV caches, making it a scalable solution for multi model inference.
Accuracy
ICaRus achieves comparable accuracy to task-specific fine-tuned models, indicating that the proposed architecture does not compromise on performance.
Demerits
Limited evaluation
The article only evaluates ICaRus on a limited set of tasks and multi agent workflows, which may not be representative of all possible scenarios.
Dependence on LoRA adapters
ICaRus relies on lightweight adapters such as LoRA to parallelize KV cache generation and next-token prediction during decoding, which may introduce additional complexity.
Expert Commentary
The article ICaRus: Identical Cache Reuse for Efficient Multi Model Inference makes a significant contribution to the field of multi model inference by proposing a novel architecture that addresses the issue of cache memory explosion and unexpected evictions. By decomposing the decoder-only Transformer into a logical encoder and decoder, ICaRus allows multiple models to share an identical KV cache across all layers, leading to significant improvements in latency and throughput. The article demonstrates that ICaRus can achieve comparable accuracy to task-specific fine-tuned models, indicating that the proposed architecture does not compromise on performance. However, the article has some limitations, such as limited evaluation and dependence on LoRA adapters. Nevertheless, ICaRus has significant implications for the development of agentic AI systems and highlights the importance of efficient caching mechanisms in multi model inference.
Recommendations
- ✓ Future work should focus on evaluating ICaRus on a more diverse set of tasks and multi agent workflows to assess its generalizability.
- ✓ The development of more efficient caching mechanisms, such as ICaRus, should be a priority for multi model inference systems to achieve scalability and efficiency.