An AI often knows the correct answer in its head even when it writes out a completely wrong explanation.
April 29, 2026
Original Paper
When Chain-of-Thought Fails, the Solution Hides in the Hidden States
arXiv · 2604.23351
The Takeaway
Chain-of-Thought reasoning is supposed to show the model's inner logic, but it can be surprisingly deceptive. This research shows that the correct answer is frequently present in the model's internal hidden states while its written output is wrong. The model essentially knows the truth but fails to translate that internal signal into its text response. This means we cannot fully trust a model's written reasoning as a perfect reflection of its knowledge. Probing these internal states could provide a more reliable way to get the truth out of an AI than just reading what it says.
From the abstract
Whether intermediate reasoning is computationally useful or merely explanatory depends on whether chain-of-thought (CoT) tokens contain task-relevant information. We present a mechanistic causal analysis of CoT on GSM8K using activation patching: transferring token-level hidden states from a CoT generation to a direct-answer run for the same question, then measuring the effect on final-answer accuracy. Across models, generating after patching yields substantially higher accuracy than both direct