Coupled Query-Key Dynamics for Attention
arXiv:2604.01683v1 Announce Type: new Abstract: Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention ($-$6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8$\times$ higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that cou
arXiv:2604.01683v1 Announce Type: new Abstract: Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention ($-$6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8$\times$ higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a \emph{sample-efficiency} mechanism: standard attention trained for 2.4$\times$ longer (matching wall-clock) reaches the same perplexity, but requires 2.4$\times$ more tokens. The advantage scales to 150M ($-$6.7\%) but narrows at 350M ($-$1.0\%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 $-$6.6\%, PubMed $-$4.5\%) but degrades on heterogeneous web text ($+$10.3\%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.
Executive Summary
This article presents a novel approach to attention mechanisms in language modeling, dubbed Coupled Query-Key Dynamics (CQKD). By introducing shared learned dynamics between queries and keys before scoring, CQKD improves language modeling perplexity and training stability. The study demonstrates significant performance gains on WikiText-103, with a 6.6% reduction in perplexity compared to standard attention. However, the benefit of CQKD is corpus-dependent, and it degrades on heterogeneous web text and shows no benefit on GLUE. The authors provide practical guidelines on when CQKD is beneficial and when it is not. Overall, CQKD is a promising innovation in the field of language modeling, with potential applications in various NLP tasks.
Key Points
- ▸ CQKD introduces shared learned dynamics between queries and keys before scoring
- ▸ CQKD improves language modeling perplexity and training stability
- ▸ The benefit of CQKD is corpus-dependent and degrades on heterogeneous web text
Merits
Strength in Language Modeling
CQKD demonstrates significant performance gains on WikiText-103, with a 6.6% reduction in perplexity compared to standard attention
Improved Training Stability
CQKD improves training stability, allowing for more efficient training and better generalization
Demerits
Corpus-Dependent Benefit
The benefit of CQKD is corpus-dependent and degrades on heterogeneous web text and shows no benefit on GLUE
Additional Computational Cost
CQKD requires additional computational resources due to the shared learned dynamics
Expert Commentary
While CQKD is a promising innovation in the field of language modeling, its corpus-dependent benefit and additional computational cost must be carefully considered. The study provides valuable insights into the strengths and limitations of CQKD, and its practical guidelines for when CQKD is beneficial and when it is not. As the field of NLP continues to evolve, CQKD may become a valuable tool in the development of more effective language models. However, further research is needed to fully understand the potential applications and limitations of CQKD.
Recommendations
- ✓ Further research is needed to explore the potential applications of CQKD in various NLP tasks
- ✓ The development of more efficient and scalable versions of CQKD could help to overcome its additional computational cost
Sources
Original: arXiv - cs.LG