AI & ML Breaks Assumption

Softmax normalization mathematically mandates the creation of attention sinks to serve as 'null states' when models need to ignore input.

March 13, 2026

Original Paper

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo

arXiv · 2603.11487

The Takeaway

This paper provides the first formal proof that attention sinks (the concentration of attention on fixed tokens) are a structural necessity of softmax self-attention. It demonstrates that moving to non-normalized ReLU attention eliminates sinks, offering a clear path for designing more efficient long-context architectures without artificial 'anchor' tokens.

From the abstract

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. We prove that computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when