Academic

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

arXiv:2603.20616v1 Announce Type: new Abstract: Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the

arXiv:2603.20616v1 Announce Type: new Abstract: Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.

Executive Summary

This article proposes MixedDimKV, a novel approach to key-value cache compression for transformer inference. By allocating dimensions to tokens at a granular level, MixedDimKV achieves efficient cache compression while maintaining performance comparable to full attention. Compared to existing token eviction methods, MixedDimKV outperforms prior KV cache compression methods, particularly when equipped with head-level importance information. The approach demonstrates significant memory reduction, achieving comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, MixedDimKV maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache. These findings have significant implications for the deployment of long-context transformer models.

Key Points

  • MixedDimKV is a novel approach to key-value cache compression that allocates dimensions to tokens at a granular level.
  • MixedDimKV outperforms prior KV cache compression methods, particularly when equipped with head-level importance information.
  • The approach achieves significant memory reduction, maintaining performance comparable to full attention on LongBench with only 6.25% of the KV cache.

Merits

Improved Efficiency

MixedDimKV achieves efficient cache compression while maintaining performance comparable to full attention, making it an attractive solution for long-context transformer models.

Scalability

The approach demonstrates significant memory reduction, allowing for the deployment of large transformer models on resources-constrained devices.

Flexibility

MixedDimKV can be easily integrated with existing KV cache compression methods, providing a flexible solution for a wide range of applications.

Demerits

Complexity

The approach requires a deep understanding of KV cache compression and transformer inference, which may pose a barrier to adoption for some researchers and practitioners.

Computational Overhead

The integration of head-level importance information may introduce additional computational overhead, which needs to be carefully evaluated in practice.

Expert Commentary

The proposed approach of MixedDimKV is a significant contribution to the field of transformer inference, addressing the critical issue of memory cost in long-context deployment. By allocating dimensions to tokens at a granular level, MixedDimKV achieves efficient cache compression while maintaining performance comparable to full attention. The approach demonstrates a deep understanding of the underlying mechanisms of transformer inference and KV cache compression, showcasing the author's expertise in the field. However, as with any novel approach, there may be challenges in terms of complexity and computational overhead, which need to be carefully evaluated in practice. Overall, MixedDimKV is a promising solution for the efficient deployment of transformer models, and its implications for the broader field of natural language processing are substantial.

Recommendations

  • Further research is needed to evaluate the performance of MixedDimKV on a wider range of applications and resources-constrained devices.
  • The approach should be integrated with existing KV cache compression methods to provide a more comprehensive solution for transformer inference.

Sources

Original: arXiv - cs.LG