RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
arXiv:2603.13289v1 Announce Type: new Abstract: The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multi-agent large language model (LLM) systems. However, these collaborative architectures introduce a critical bottleneck: redundant prefill computation for shared content generated by previous agents, which significantly increases KV cache memory usage and time-to-first-token (TTFT). While various KV cache methods have been proposed to mitigate prefill redundancy, they either fail to maintain accuracy on agent-generated outputs or exhibit low reuse rates due to rigid constraints. We present RelayCaching, a training-free inference method that directly reuses decoding phase KV caches from previous agents in subsequent prefill phases. Our key insight is that KV caches for identical content are highly consistent across phases, while prefix-induced deviations are sparse and localized within a limited range of layers and token positions. By selective
arXiv:2603.13289v1 Announce Type: new Abstract: The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multi-agent large language model (LLM) systems. However, these collaborative architectures introduce a critical bottleneck: redundant prefill computation for shared content generated by previous agents, which significantly increases KV cache memory usage and time-to-first-token (TTFT). While various KV cache methods have been proposed to mitigate prefill redundancy, they either fail to maintain accuracy on agent-generated outputs or exhibit low reuse rates due to rigid constraints. We present RelayCaching, a training-free inference method that directly reuses decoding phase KV caches from previous agents in subsequent prefill phases. Our key insight is that KV caches for identical content are highly consistent across phases, while prefix-induced deviations are sparse and localized within a limited range of layers and token positions. By selectively recomputing KV caches at these positions, RelayCaching preserves model accuracy with minimal overhead, yielding a superior accuracy-efficiency trade-off over existing methods. Experiments on diverse collaborative LLM tasks spanning mathematical reasoning, general knowledge, and code generation demonstrate that RelayCaching achieves over 80% KV cache reuse, reduces TTFT by up to $4.7\times$ compared to the standard pipeline, all with negligible accuracy degradation.
Executive Summary
The article introduces RelayCaching, a novel method for accelerating large language model (LLM) collaboration by reusing decoding phase KV caches from previous agents. This approach addresses the bottleneck of redundant prefill computation, reducing time-to-first-token (TTFT) by up to 4.7 times while maintaining model accuracy. RelayCaching achieves over 80% KV cache reuse, making it a superior solution for collaborative LLM tasks.
Key Points
- ▸ RelayCaching reuses decoding phase KV caches from previous agents to reduce redundant prefill computation
- ▸ The method achieves over 80% KV cache reuse and reduces TTFT by up to 4.7 times
- ▸ RelayCaching preserves model accuracy with minimal overhead, yielding a superior accuracy-efficiency trade-off
Merits
Improved Efficiency
RelayCaching significantly reduces the computational overhead associated with redundant prefill computation, making it an attractive solution for collaborative LLM tasks.
Demerits
Limited Generalizability
The effectiveness of RelayCaching may be limited to specific types of collaborative LLM tasks, and its performance on other tasks is uncertain.
Expert Commentary
The introduction of RelayCaching marks a significant advancement in the field of collaborative LLM systems. By addressing the bottleneck of redundant prefill computation, RelayCaching has the potential to accelerate a wide range of applications, from mathematical reasoning to code generation. However, further research is needed to fully understand the limitations and potential applications of this technology. As LLM systems continue to evolve, it is essential to prioritize the development of efficient and effective methods for optimizing their performance.
Recommendations
- ✓ Further research should be conducted to explore the limitations and potential applications of RelayCaching
- ✓ The development of RelayCaching should be considered in the context of broader efforts to optimize LLM systems and improve their overall performance.