Academic

Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen, Yilong Chen, Yinqi Yang, Junyuan Shang, Zhenyu Zhang, Zefeng Zhang, Shuaiyi Nie, Shuohuan Wang, Yu Sun, Hua Wu, HaiFeng Wang, Tingwen Liu · March 26, 2026 · 1 min read · 5 views

#cs.CL

arXiv:2603.23998v1 Announce Type: new Abstract: Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.

Executive Summary

This article presents the Sparse Growing Transformer (SGT), a novel training-time sparse depth allocation framework for Transformers. SGT progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads, inducing structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Compared to static block-level looping baselines, SGT consistently outperforms under comparable settings while reducing additional training FLOPs overhead. The SGT mechanism demonstrates a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. The authors provide extensive experiments across multiple parameter scales, showcasing the efficacy of SGT in improving performance while reducing computational overhead.

Key Points

▸ SGT progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads.
▸ SGT induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves.
▸ SGT consistently outperforms training-time static block-level looping baselines under comparable settings.

Merits

Strength

The SGT framework demonstrates a unique ability to adaptively allocate depth during training, leading to improved performance and reduced computational overhead.

Improved Performance

SGT consistently outperforms static block-level looping baselines under comparable settings, showcasing its efficacy in improving model performance.

Reduced Computational Overhead

SGT reduces additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.

Demerits

Limitation

The SGT framework relies on targeted attention looping on informative heads, which may not be universally applicable across all datasets and tasks.

Complexity

The SGT mechanism introduces additional complexity to the Transformer architecture, which may require significant computational resources and expertise to implement.

Expert Commentary

The SGT framework presents a novel and intriguing approach to training-time sparse depth allocation for Transformers. By selectively increasing depth only for a small subset of parameters as training evolves, SGT demonstrates a unique ability to adaptively allocate depth during training. The extensive experiments provided by the authors showcase the efficacy of SGT in improving performance while reducing computational overhead. However, the SGT framework also introduces additional complexity to the Transformer architecture, which may require significant computational resources and expertise to implement. Further research is needed to fully explore the potential of SGT and its applications in real-world scenarios.

Recommendations

✓ Future research should aim to explore the applicability of SGT across different datasets and tasks, as well as its potential to be combined with other sparse Transformer frameworks.
✓ The SGT framework presents an opportunity for further investigation into the importance of attention mechanisms in achieving improved performance, and how they can be optimized for specific tasks and applications.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

AI Commentary

Executive Summary

Key Points

Merits

Strength

Improved Performance

Reduced Computational Overhead

Demerits

Limitation

Complexity

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.