Nature Is Weird / AI

Those 'buggy' high-value outlier tokens in Vision Transformers are actually the model's internal 'scratchpads.'

The Takeaway

For years, engineers thought outlier tokens with massive norms were numerical errors or noise to be suppressed. This paper proves they are functional: the model evolved these outliers to act as 'scratchpads' for storing and redistributing global information across the image. When you remove them, the model's ability to 'understand' the scene collapses. This discovery changes how we should prune and quantize vision models—you can't just clip the outliers. It also introduces the Outlier-Window Attention Mechanism to better handle this emergent behavior. It turns a perceived bug into a critical feature for building better computer vision systems.

By SeriesFusion Editorial Board · April 16, 2026

Original Paper

Understanding Outlier-tokens in Vision Transformers: The Scratchpad Hypothesis and an Outlier-Window Attention Mechanism

Ziyu Li, Dongbo Zhou, Run Shao, Zhaoyang Zhang, Xuezhi Cui, Gaozhi Zhou, Linrui Xu, Peng Shen, Wentao Yang, Haifeng Li

SSRN · 6581885

From the abstract

Vision Transformers (ViTs) have achieved strong performance across a wide range of computer vision tasks, motivating closer examination of their internal mechanisms. Prior studies have identified a special class of ViT tokens with abnormally high L2 norms, referred to as Outlier-tokens, which are often viewed as undesirable anomalies. However, we find that removing these Outlier-tokens leads to severe degradation in visual representation capability, suggesting that they are not redundant noise b

Read the original paper →

← Back to today's papers