Those 'buggy' high-value outlier tokens in Vision Transformers are actually the model's internal 'scratchpads.'
April 16, 2026
Original Paper
Understanding Outlier-tokens in Vision Transformers: The Scratchpad Hypothesis and an Outlier-Window Attention Mechanism
SSRN · 6581885
The Takeaway
For years, engineers thought outlier tokens with massive norms were numerical errors or noise to be suppressed. This paper proves they are functional: the model evolved these outliers to act as 'scratchpads' for storing and redistributing global information across the image. When you remove them, the model's ability to 'understand' the scene collapses. This discovery changes how we should prune and quantize vision models—you can't just clip the outliers. It also introduces the Outlier-Window Attention Mechanism to better handle this emergent behavior. It turns a perceived bug into a critical feature for building better computer vision systems.
From the abstract
Vision Transformers (ViTs) have achieved strong performance across a wide range of computer vision tasks, motivating closer examination of their internal mechanisms. Prior studies have identified a special class of ViT tokens with abnormally high L2 norms, referred to as Outlier-tokens, which are often viewed as undesirable anomalies. However, we find that removing these Outlier-tokens leads to severe degradation in visual representation capability, suggesting that they are not redundant noise b