AI & ML Paradigm Challenge

The 'magic' of Transformers might just be a 100-year-old statistical algorithm running inside a neural network.

April 16, 2026

Original Paper

Ordinary Least Squares is a Special Case of Transformer

Xiaojun Tan, Yuchen Zhao

arXiv · 2604.13656

AI-generated illustration

The Takeaway

This paper provides an algebraic proof that a single-layer Linear Transformer is mathematically equivalent to Ordinary Least Squares (OLS) regression. This is a huge paradigm shift in how we interpret what attention is doing. It suggests that instead of 'thinking,' Transformers are essentially performing highly efficient, closed-form statistical projections in a single pass. This explains why they are so good at in-context learning—they are literally 'solving' the data. For researchers, this opens a path to more interpretable and mathematically grounded architectures. It bridges the gap between 'mysterious' deep learning and 'transparent' classical statistics.

From the abstract

The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention m