Nature Is Weird / AI

Parameter directions and learned features show a 100 to 330x coupling that standard optimizer paths completely mask.

The Takeaway

AdamW trajectories often hide the most significant structural patterns of how a neural network actually learns. Standard optimization tools smooth over the training process, making it look like weights move in a simple way. Using gradient-direction sensitivity instead reveals massive hidden couplings that are up to 330 times stronger than what we previously observed. This discovery means the field has been looking at the wrong map to understand feature emergence. Future architecture designs can now target these high-sensitivity directions to build more efficient models.

By SeriesFusion Editorial Board · May 1, 2026

Original Paper

Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

Yongzhong Xu

arXiv · 2604.25143

From the abstract

We show that replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude. Performing SVD on the loss gradient instead of the AdamW update increases the measured perturbative coupling between SED directions and Linear Centroid Hypothesis (LCH) features from $ \bar{R}_k \approx 3 $--$9\times$ to $100$--$330\times$ across four single-task modular arithmetic operations, eliminating the apparent operation dependence in the original

Read the original paper →

← Back to today's papers