AI & ML Scaling Insight

Provides a high-dimensional theoretical foundation for why two-phase optimizers like DiLoCo are mathematically superior to standard SGD in specific noise regimes.

March 31, 2026

Original Paper

High dimensional theory of two-phase optimizers

Atish Agarwala

arXiv · 2603.26954

The Takeaway

DiLoCo is becoming a standard for training LLMs over slow networks or decentralized clusters. This paper proves that the local-then-global synchronization strategy isn't just an engineering workaround; it creates a beneficial signal-to-noise tradeoff and acceleration via 'stacked' momentum that can improve training quality over traditional synchronous methods.

From the abstract

The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the

Read the original paper →

← Back to today's papers