AI models can actually get 'brain fog' where their old thoughts clutter up their heads so much they forget how to think straight.
April 13, 2026
Original Paper
Robust Reasoning Benchmark
arXiv · 2604.08571
The Takeaway
This paper shows that intermediate reasoning steps act like cognitive clutter that degrades performance over time in a single conversation. It suggests that more 'thinking' isn't always better; it can actually confuse the model by filling its memory with distracting noise.
From the abstract
While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collap