Nature Is Weird / AI

Up to $91\%$ of the attention in translation models is sucked up by punctuation and language tags rather than actual words.

The Takeaway

Most research on how AI translates languages assumes the model is focusing on the meaning of words. In reality, these attention sinks act as massive sponges that pull focus away from content tokens. This means many of our current tools for interpreting AI decisions are looking at the wrong things. Previous studies might have been misled by these non-content markers that distort the model's internal map. Fixing these sinks could lead to much more efficient and interpretable translation systems for rare languages.

By SeriesFusion Editorial Board · May 5, 2026

Original Paper

Attention Sinks in Massively Multilingual Neural Machine Translation: Discovery, Analysis, and Mitigation

Hillary Mutisya, John Mugane

arXiv · 2605.01229

From the abstract

Cross-attention patterns in neural machine translation (NMT) are widely used to study how multilingual models align linguistic structure. We report a systematic artifact in cross-attention analysis of NLLB-200 (600M): non-content tokens - primarily end-of-sequence tokens, language tags, and punctuation - capture 83 percent to 91 percent of total cross-attention mass. We term these "attention sinks," extending findings from LLMs [Xiao et al., 2023] to NMT cross-attention and identifying a causal

Read the original paper →

← Back to today's papers