AI & ML Nature Is Weird

A wave of compression travels through an AI's brain like a physical ripple during the training process.

April 29, 2026

Original Paper

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

arXiv · 2604.22778

The Takeaway

Training a transformer is often viewed as a chaotic adjustment of billions of numbers at once. This study discovered that rank compression actually moves through the model in a predictable, traveling wave from the first layers to the last. The researchers also found that the Query and Key parts of the model drive this movement, while the Value part remains static. This functional asymmetry suggests that different parts of the attention mechanism have very different roles in learning. Understanding these physical-like dynamics could lead to much faster and more stable training methods.

From the abstract

We present the first systematic study of weight matrix singular value spectra \emph{during} transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). We discover three phenomena: \textbf{(1)~Transient Compression Waves:} stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then \emph{reverses} -- late layers eventually over-compr

Read the original paper →

← Back to today's papers