Academic

Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

arXiv:2603.11564v1 Announce Type: new Abstract: The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware ps

Z
Zhenxu Tian, Yi Su, Juntao Li, Min Zhang
· · 1 min read · 14 views

arXiv:2603.11564v1 Announce Type: new Abstract: The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).

Executive Summary

This arXiv article proposes a novel approach to Key-Value (KV) cache compression for Large Language Models (LLMs) inference. The authors argue that existing methods rely on input-side attention patterns and fail to preserve critical tokens for future generation. To address this, they introduce decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), which leverages positional information to simulate output tokens and establish an effective observation window for importance assessment. The proposed method achieves superior performance, particularly under strict memory constraints, and demonstrates nearly lossless performance on multiple benchmarks. The authors' use of position-aware pseudo queries is a significant contribution, as it aligns closely with the actual generation context and enables precise token eviction. The implications of this work are substantial, as efficient KV cache compression is critical for widespread adoption of LLMs in various applications.

Key Points

  • Existing KV cache compression methods fail to preserve critical tokens for future generation.
  • Decoding-aligned KV cache compression via position-aware pseudo queries (DapQ) is proposed as a novel approach.
  • DapQ leverages positional information to simulate output tokens and establish an effective observation window for importance assessment.

Merits

Strength in Empirical Evaluation

The article presents extensive evaluations across multiple benchmarks and LLMs, demonstrating the superiority of DapQ, particularly under strict memory constraints.

Innovative Use of Position-aware Pseudo Queries

The authors' use of position-aware pseudo queries is a significant contribution, as it aligns closely with the actual generation context and enables precise token eviction.

Demerits

Limited Generalizability

The proposed method may not generalize well to other domains or applications, as the authors primarily focus on Large Language Models (LLMs) inference.

Computational Complexity

The computational complexity of DapQ is not thoroughly analyzed, which may limit its practical applicability in resource-constrained environments.

Expert Commentary

The proposed method, DapQ, is a significant contribution to the field of KV cache compression for LLMs inference. The authors' use of position-aware pseudo queries is a novel approach that aligns closely with the actual generation context and enables precise token eviction. The article presents extensive evaluations that demonstrate the superiority of DapQ, particularly under strict memory constraints. However, the limited generalizability of the proposed method and the lack of thorough analysis of computational complexity are potential drawbacks. Nevertheless, the article's contribution to the development of KV cache compression techniques and its practical implications for the adoption of LLMs make it a valuable addition to the field.

Recommendations

  • Further research is needed to generalize the proposed method to other domains and applications.
  • A more thorough analysis of the computational complexity of DapQ is necessary to determine its practical applicability in resource-constrained environments.

Sources