SeriesFusion
Science, curated & edited by AI
Paradigm Challenge  /  AI

A slow optimization trick from the 1950s actually works on billion-parameter AI models because of a mathematical loophole.

Zeroth-order optimization scales to massive models because its error rate depends on the size of the output, not the number of parameters. This explains the paradox of why a method that should be too slow for modern AI actually works in practice. Most experts assumed that only gradient-based methods could handle the complexity of LLMs, but this research proves otherwise. It opens the door to training AI on hardware that cannot handle complex gradients, like analog or biological systems. This loophole could lead to a new generation of hyper-efficient training techniques.

Original Paper

Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective

Zhe Li, Bicheng Ying, Zidong Liu, Haibo Yang

arXiv  ·  2605.03373

Classical optimization theory establishes that zeroth-order (ZO) algorithms suffer from a dimension-dependent slowdown, with convergence rates typically scaling with the model dimension compared to first-order methods. However, in contrast to these theoretical expectations, a growing body of recent work demonstrates the successful application of ZO methods to fine-tuning Large Language Models (LLMs) with billions of parameters. To explain this paradox, we derive the one-step learning dynamics of