Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
arXiv:2603.15958v1 Announce Type: new Abstract: Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from
arXiv:2603.15958v1 Announce Type: new Abstract: Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from the literature under a unified and principled perspective, with clear directions open for future research. Our results draw particular attention to the interaction between momentum and batch-size scaling, suggesting that optimal performance may be achieved with several scaling strategies.
Executive Summary
This article makes significant contributions to the field of hyperparameter transfer by deriving closed-form power-law schedules for learning rate, momentum, and batch size through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO). By treating bounds in recent literature as a proxy and minimizing them across different tuning regimes, the authors recover most insights and observations from the literature under a unified and principled perspective. The results suggest that optimal performance may be achieved with several scaling strategies, particularly highlighting the interaction between momentum and batch-size scaling. The study provides a comprehensive framework for understanding hyperparameter scaling laws, with clear directions open for future research.
Key Points
- ▸ Derivation of closed-form power-law schedules for learning rate, momentum, and batch size
- ▸ Unified framework for understanding hyperparameter scaling laws through the lens of LMO
- ▸ Identification of interaction between momentum and batch-size scaling as a key factor in optimal performance
Merits
Strength in Theoretical Foundation
The article builds on recent convergence bounds for methods based on LMO, providing a strong theoretical foundation for the derived hyperparameter scaling laws.
Practical Implications
The results offer clear directions for optimizing hyperparameters in large-scale training recipes, with potential applications in deep learning and optimization.
Demerits
Limited Empirical Validation
The study relies on theoretical analysis and proxy bounds, which may not directly translate to real-world applications without further empirical validation.
Assumptions and Simplifications
The authors assume a fixed model size and neglect certain complexities, such as non-convexity and non-smoothness, which may limit the generalizability of the results.
Expert Commentary
While the article makes significant contributions to the field of hyperparameter transfer, it is essential to acknowledge the limitations of the study. The reliance on theoretical analysis and proxy bounds may limit the generalizability of the results. Furthermore, the assumptions and simplifications made in the study, such as assuming a fixed model size, may not accurately reflect real-world applications. Nevertheless, the study provides a valuable framework for understanding hyperparameter scaling laws and offers clear directions for future research.
Recommendations
- ✓ Future studies should aim to empirically validate the derived power-law schedules and investigate their generalizability to more complex scenarios.
- ✓ Researchers should consider relaxing the assumptions and simplifications made in the study to develop a more comprehensive understanding of hyperparameter scaling laws.