AI & ML Nature Is Weird

We've found the mathematical 'smoking gun' for Transformers: they are literally running Mirror Descent to learn from your prompt.

April 15, 2026

Original Paper

Transformers Learn Latent Mixture Models In-Context via Mirror Descent

arXiv · 2604.10848

The Takeaway

This paper proves that Transformers implement a specific optimization algorithm called Mirror Descent to perform in-context learning. Before this, the way a Transformer 'reasoned' through a prompt was a complete mystery. Now, we have a rigorous mathematical explanation: the model is performing a classic optimization process inside its layers to approximate a Bayes-optimal predictor. This moves us from 'vibe-based' understanding to formal proofs of how LLMs learn. For researchers, this provides a clear roadmap for improving in-context learning by optimizing the underlying Mirror Descent mechanics.

From the abstract

Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The

Read the original paper →

← Back to today's papers