Up to $91\%$ of the attention in translation models is sucked up by punctuation and language tags rather than actual words.
Most research on how AI translates languages assumes the model is focusing on the meaning of words. In reality, these attention sinks act as massive sponges that pull focus away from content tokens. This means many of our current tools for interpreting AI decisions are looking at the wrong things. Previous studies might have been misled by these non-content markers that distort the model's internal map. Fixing these sinks could lead to much more efficient and interpretable translation systems for rare languages.
Attention Sinks in Massively Multilingual Neural Machine Translation: Discovery, Analysis, and Mitigation
arXiv · 2605.01229
Cross-attention patterns in neural machine translation (NMT) are widely used to study how multilingual models align linguistic structure. We report a systematic artifact in cross-attention analysis of NLLB-200 (600M): non-content tokens - primarily end-of-sequence tokens, language tags, and punctuation - capture 83 percent to 91 percent of total cross-attention mass. We term these "attention sinks," extending findings from LLMs [Xiao et al., 2023] to NMT cross-attention and identifying a causal