Academic

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

arXiv:2603.10079v1 Announce Type: new Abstract: We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function $G$, depending only on the kernel, learning rate $\eta$ and data, is positive, SGD produces large NTK-flattening spikes with high probability; when $G<0$, their probability decays like $(n/\eta)^{-\vartheta/2}$, for an explicitly characterised $\vartheta\in (0,\infty)$. This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.

Benjamin Gess, Daniel Heydecker · March 12, 2026 · 1 min read · 8 views

#cs.LG #math.PR

Executive Summary

The article presents a quantitative theory of the catapult phase in Stochastic Gradient Descent (SGD) training of a shallow, fully connected network. It identifies a criterion that separates two behaviors, resulting in large NTK-flattening spikes with high probability when the function G is positive, and a decaying probability when G is negative. This theory provides a parameter-dependent explanation for the observation of such spikes at practical widths. The authors' analysis offers insights into the large-deviations view of SGD, shedding light on the underlying mechanisms of the training process.

Key Points

▸ Quantitative theory of the catapult phase in SGD training
▸ Criterion separating two behaviors based on the function G
▸ Parameter-dependent explanation for large NTK-flattening spikes

Merits

Rigorous Mathematical Framework

The article provides a rigorous mathematical framework for analyzing SGD training, allowing for a deeper understanding of the underlying mechanisms.

Demerits

Limited Applicability

The theory may be limited to specific network architectures and training settings, which could restrict its applicability to more general deep learning scenarios.

Expert Commentary

The article's contribution to the understanding of SGD training is significant, as it provides a quantitative theory of the catapult phase and sheds light on the underlying mechanisms of the training process. The identification of the criterion separating two behaviors based on the function G is a key insight, allowing for a more nuanced understanding of the factors influencing the emergence of large NTK-flattening spikes. However, further research is needed to fully explore the implications of this theory and its potential applications in deep learning.

Recommendations

✓ Further investigation into the applicability of the theory to more general deep learning scenarios
✓ Exploration of the potential applications of the article's findings in the development of optimization algorithms and AI systems

Sources

arXiv - cs.LG

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

AI Commentary

Executive Summary

Key Points

Merits

Rigorous Mathematical Framework

Demerits

Limited Applicability

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.