Transformers and diffusion models are actually the same mathematical object viewed from different angles.
April 14, 2026
Original Paper
The Diffusion-Attention Connection
arXiv · 2604.09560
The Takeaway
The paper unifies Transformers, diffusion maps, and magnetic Laplacians into a single Markov geometry derived from pre-softmax query scores. This means architectural breakthroughs in diffusion—like stable sampling schedules—can be directly translated into Transformer attention mechanisms through a shared geometric lens.
From the abstract
Transformers, diffusion-maps, and magnetic Laplacians are usually treated as separate tools; we show they are all different regimes of a single Markov geometry built from pre-softmax query-scores. We define a QK "bidivergence" whose exponentiated and normalized forms yield attention, diffusion-maps, and magnetic diffusion. And use product of experts and Schrödinger-bridges to connect and organize them into equilibrium, nonequilibrium steady-state, and driven dynamics.