AI & ML Efficiency Breakthrough

Low-precision optimizer states cause 'state staleness' where updates round back to stored values, but scheduled resets can fully recover performance loss.

March 18, 2026

Original Paper

Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets

Kristi Topollai, Anna Choromanska

arXiv · 2603.16731

The Takeaway

It provides a mechanistic explanation for why 4-bit and 8-bit optimizers often lag behind full-precision counterparts. By applying a theory-guided reset schedule, researchers can achieve full-precision performance with significantly lower memory footprints, enabling larger-scale pre-training on existing hardware.

From the abstract

Quantizing optimizer states is becoming an important ingredient of memory-efficient large-scale pre-training, but the resulting optimizer dynamics remain only partially understood. We study low-precision exponential moving average (EMA) optimizer states and show how quantization can cause many nominal updates to round back to the same stored value, making the state effectively stale and slowing adaptation beyond what the nominal decay would suggest. We then develop a simple predictive model of s