Making models larger actually makes them worse at ignoring irrelevant junk text.
April 16, 2026
Original Paper
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
arXiv · 2604.13275
The Takeaway
The research reveals a scaling paradox: while bigger models get better at spotting false facts, they simultaneously become more prone to 'mindless copying' of irrelevant non-semantic tokens. This divergence in 'contextual entrainment' means larger models are more easily distracted by garbage in the prompt. Previously, the assumption was that scaling laws improved all aspects of reasoning. This paper shows that 'smart' models are uniquely vulnerable to being derailed by noisy context. This is critical for RAG practitioners who need to realize that better models might actually require *cleaner* context, not less. It challenges the 'just throw it all in the prompt' philosophy of large-context windows.
From the abstract
Larger language models become simultaneously better and worse at handling contextual information -- better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opp