AI & ML Scaling Insight

A systematic study reveals that grokking is not an architectural property of Transformers but an interaction between weight decay and optimization stability.

March 27, 2026

Original Paper

A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

Shalima Binta Manir, Anamika Paul Rupa

arXiv · 2603.25009

The Takeaway

It disentangles the confounds in previous grokking research, proving that MLPs and Transformers behave similarly when hyperparameters are matched. It identifies weight decay as the 'Goldilocks' control parameter for generalization, providing a clearer recipe for training models on modular tasks.

From the abstract

Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by inter