If you change just one tiny ingredient in an AI’s training, you can break the whole thing without a single warning light going off.
April 3, 2026
Original Paper
Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training
arXiv · 2604.01563
The Takeaway
Researchers found that certain standard components, which should work fine together, actually clash and destroy performance. This means many 'failed' AI experiments might just be hidden compatibility issues that no one noticed.
From the abstract
In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, show