Academic

Scaling Attention via Feature Sparsity

arXiv:2603.22300v1 Announce Type: new Abstract: Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from $\Theta(n^2 d)$ to $\Theta(n^2 k^2/d)$. To make this efficient at scale, we introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to $2.5\times$ and reducing FLOPs and KV-cache by nearly 50\%. On synthetic and downstream

arXiv:2603.22300v1 Announce Type: new Abstract: Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from $\Theta(n^2 d)$ to $\Theta(n^2 k^2/d)$. To make this efficient at scale, we introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to $2.5\times$ and reducing FLOPs and KV-cache by nearly 50\%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss. Code is available at https://github.com/YannX1e/Sparse-Feature-Attention.

Executive Summary

This article presents a novel approach to scaling Transformers for ultra-long contexts by introducing feature sparsity in self-attention. The proposed Sparse Feature Attention (SFA) method represents queries and keys as k-sparse codes, reducing the attention cost from O(n^2 d) to O(n^2 k^2/d). The authors also introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate on sparse overlaps. The results demonstrate that SFA matches dense baselines while improving speed by up to 2.5x and reducing FLOPs and KV-cache by nearly 50%. The method preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines. This work establishes feature-level sparsity as a complementary axis for efficient attention, enabling Transformers to scale to longer contexts with minimal quality loss.

Key Points

  • Sparse Feature Attention (SFA) reduces attention cost by representing queries and keys as k-sparse codes.
  • FlashSFA is an IO-aware kernel that extends FlashAttention to operate on sparse overlaps.
  • SFA improves speed by up to 2.5x and reduces FLOPs and KV-cache by nearly 50% while matching dense baselines.

Merits

Strength in Scalability

SFA enables Transformers to scale to ultra-long contexts with minimal quality loss, making it a valuable approach for various NLP applications.

Demerits

Limitation in Expressivity

The k-sparse representation may limit the expressivity of the model, potentially affecting its performance on certain tasks that require high-dimensional embeddings.

Expert Commentary

The proposed Sparse Feature Attention (SFA) method is a significant contribution to the field of efficient attention mechanisms in Transformers. By introducing feature sparsity, SFA reduces the attention cost while maintaining the accuracy of the model. The results demonstrate the effectiveness of SFA in scaling Transformers to ultra-long contexts. However, it is essential to consider the limitation of expressivity in the k-sparse representation. Future work should focus on addressing this limitation while maintaining the scalability of SFA.

Recommendations

  • Further investigation into the trade-off between expressivity and scalability in SFA is necessary to ensure its applicability in various NLP tasks.
  • The development of more efficient attention mechanisms that can handle ultra-long contexts is crucial for the continued growth of Transformers in real-world applications.

Sources

Original: arXiv - cs.LG