Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
arXiv:2603.10079v1 Announce Type: new Abstract: We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function $G$, depending only on the kernel, learning rate $\eta$ and data, is positive, SGD produces large NTK-flattening spikes with high probability; when $G<0$, their probability decays like $(n/\eta)^{-\vartheta/2}$, for an explicitly characterised $\vartheta\in (0,\infty)$. This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.
arXiv:2603.10079v1 Announce Type: new Abstract: We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function $G$, depending only on the kernel, learning rate $\eta$ and data, is positive, SGD produces large NTK-flattening spikes with high probability; when $G<0$, their probability decays like $(n/\eta)^{-\vartheta/2}$, for an explicitly characterised $\vartheta\in (0,\infty)$. This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.
Executive Summary
The article presents a quantitative theory of the catapult phase in Stochastic Gradient Descent (SGD) training of a shallow, fully connected network. It identifies a criterion that separates two behaviors, resulting in large NTK-flattening spikes with high probability when the function G is positive, and a decaying probability when G is negative. This theory provides a parameter-dependent explanation for the observation of such spikes at practical widths. The authors' analysis offers insights into the large-deviations view of SGD, shedding light on the underlying mechanisms of the training process.
Key Points
- ▸ Quantitative theory of the catapult phase in SGD training
- ▸ Criterion separating two behaviors based on the function G
- ▸ Parameter-dependent explanation for large NTK-flattening spikes
Merits
Rigorous Mathematical Framework
The article provides a rigorous mathematical framework for analyzing SGD training, allowing for a deeper understanding of the underlying mechanisms.
Demerits
Limited Applicability
The theory may be limited to specific network architectures and training settings, which could restrict its applicability to more general deep learning scenarios.
Expert Commentary
The article's contribution to the understanding of SGD training is significant, as it provides a quantitative theory of the catapult phase and sheds light on the underlying mechanisms of the training process. The identification of the criterion separating two behaviors based on the function G is a key insight, allowing for a more nuanced understanding of the factors influencing the emergence of large NTK-flattening spikes. However, further research is needed to fully explore the implications of this theory and its potential applications in deep learning.
Recommendations
- ✓ Further investigation into the applicability of the theory to more general deep learning scenarios
- ✓ Exploration of the potential applications of the article's findings in the development of optimization algorithms and AI systems