Academic

Why Grokking Takes So Long: A First-Principles Theory of Representational Phase Transitions

arXiv:2603.13331v1 Announce Type: new Abstract: Grokking is the sudden generalization that appears long after a model has perfectly memorized its training data. Although this phenomenon has been widely observed, there is still no quantitative theory explaining the length of the delay between memorization and generalization. Prior work has noted that weight decay plays an important role, but no result derives tight bounds for the delay or explains its scaling behavior. We present a first-principles theory showing that grokking arises from a norm-driven representational phase transition in regularized training dynamics. Training first converges to a high-norm memorization solution and only later contracts toward a lower-norm structured representation that generalizes. Our main result establishes a scaling law for the delay: T_grok - T_mem = Theta((1 / gamma_eff) * log(||theta_mem||^2 / ||theta_post||^2)), where gamma_eff is the effective contraction rate of the optimizer (gamma_

arXiv:2603.13331v1 Announce Type: new Abstract: Grokking is the sudden generalization that appears long after a model has perfectly memorized its training data. Although this phenomenon has been widely observed, there is still no quantitative theory explaining the length of the delay between memorization and generalization. Prior work has noted that weight decay plays an important role, but no result derives tight bounds for the delay or explains its scaling behavior. We present a first-principles theory showing that grokking arises from a norm-driven representational phase transition in regularized training dynamics. Training first converges to a high-norm memorization solution and only later contracts toward a lower-norm structured representation that generalizes. Our main result establishes a scaling law for the delay: T_grok - T_mem = Theta((1 / gamma_eff) * log(||theta_mem||^2 / ||theta_post||^2)), where gamma_eff is the effective contraction rate of the optimizer (gamma_eff = eta lambda for SGD and gamma_eff >= eta lambda for AdamW). The upper bound follows from a discrete Lyapunov contraction argument, and the matching lower bound arises from dynamical constraints of regularized first-order optimization. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity tasks, we confirm three predictions: inverse scaling with weight decay, inverse scaling with learning rate, and logarithmic dependence on the norm ratio (R^2 > 0.97). We further find that grokking requires an optimizer that can decouple memorization from contraction: SGD fails under hyperparameters where AdamW reliably groks. These results show that grokking is a predictable consequence of norm separation between competing interpolating representations and provide the first quantitative scaling law for the delay of grokking.

Executive Summary

This article presents a first-principles theory explaining the phenomenon of grokking in deep learning models, which is characterized by a sudden generalization that occurs after the model has perfectly memorized its training data. The authors propose that grokking arises from a norm-driven representational phase transition in regularized training dynamics, and derive a scaling law for the delay between memorization and generalization. The theory is tested on a range of tasks and shows excellent agreement with the data, highlighting the importance of weight decay, learning rate, and the norm ratio in controlling the delay of grokking. The results have significant implications for the development of efficient training algorithms and the understanding of deep learning phenomena.

Key Points

  • Grokking is a sudden generalization that occurs after a model has perfectly memorized its training data.
  • A first-principles theory is proposed to explain the phenomenon of grokking.
  • The theory is based on a norm-driven representational phase transition in regularized training dynamics.
  • A scaling law is derived for the delay between memorization and generalization.

Merits

Strength of the Theoretical Framework

The authors provide a comprehensive and well-motivated theoretical framework for understanding grokking, which is grounded in the principles of regularized training dynamics and norm-driven representational phase transitions.

Empirical Validation

The results are extensively tested on a range of tasks, showing excellent agreement with the data and highlighting the importance of weight decay, learning rate, and the norm ratio in controlling the delay of grokking.

Insights into Deep Learning Phenomena

The theory provides valuable insights into the mechanisms underlying deep learning phenomena, such as the role of weight decay and learning rate in controlling the delay of grokking.

Demerits

Limited Scope of the Theory

The theory is primarily focused on the phenomenon of grokking and may not capture other deep learning phenomena, such as overfitting or underfitting.

Complexity of the Mathematical Derivations

The mathematical derivations are complex and may be challenging for non-experts to follow, which may limit the accessibility of the theory to a broader audience.

Expert Commentary

The article presents a significant contribution to the field of deep learning, providing a comprehensive and well-motivated theoretical framework for understanding the phenomenon of grokking. The results are extensively tested on a range of tasks, showing excellent agreement with the data and highlighting the importance of weight decay, learning rate, and the norm ratio in controlling the delay of grokking. However, the theory may be limited in its scope and may not capture other deep learning phenomena, such as overfitting or underfitting. Additionally, the mathematical derivations are complex and may be challenging for non-experts to follow.

Recommendations

  • The theory should be further developed and extended to capture other deep learning phenomena, such as overfitting and underfitting.
  • The mathematical derivations should be simplified and made more accessible to a broader audience.

Sources