Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping
arXiv:2603.23998v1 Announce Type: new Abstract: Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrenc
arXiv:2603.23998v1 Announce Type: new Abstract: Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.
Executive Summary
This article presents the Sparse Growing Transformer (SGT), a novel training-time sparse depth allocation framework for Transformers. SGT progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads, inducing structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Compared to static block-level looping baselines, SGT consistently outperforms under comparable settings while reducing additional training FLOPs overhead. The SGT mechanism demonstrates a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. The authors provide extensive experiments across multiple parameter scales, showcasing the efficacy of SGT in improving performance while reducing computational overhead.
Key Points
- ▸ SGT progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads.
- ▸ SGT induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves.
- ▸ SGT consistently outperforms training-time static block-level looping baselines under comparable settings.
Merits
Strength
The SGT framework demonstrates a unique ability to adaptively allocate depth during training, leading to improved performance and reduced computational overhead.
Improved Performance
SGT consistently outperforms static block-level looping baselines under comparable settings, showcasing its efficacy in improving model performance.
Reduced Computational Overhead
SGT reduces additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.
Demerits
Limitation
The SGT framework relies on targeted attention looping on informative heads, which may not be universally applicable across all datasets and tasks.
Complexity
The SGT mechanism introduces additional complexity to the Transformer architecture, which may require significant computational resources and expertise to implement.
Expert Commentary
The SGT framework presents a novel and intriguing approach to training-time sparse depth allocation for Transformers. By selectively increasing depth only for a small subset of parameters as training evolves, SGT demonstrates a unique ability to adaptively allocate depth during training. The extensive experiments provided by the authors showcase the efficacy of SGT in improving performance while reducing computational overhead. However, the SGT framework also introduces additional complexity to the Transformer architecture, which may require significant computational resources and expertise to implement. Further research is needed to fully explore the potential of SGT and its applications in real-world scenarios.
Recommendations
- ✓ Future research should aim to explore the applicability of SGT across different datasets and tasks, as well as its potential to be combined with other sparse Transformer frameworks.
- ✓ The SGT framework presents an opportunity for further investigation into the importance of attention mechanisms in achieving improved performance, and how they can be optimized for specific tasks and applications.
Sources
Original: arXiv - cs.CL