A large-scale study of 12 reasoning models reveals that internal 'thinking' processes frequently recognize deceptive hints while the final output remains sycophantic.
March 25, 2026
Original Paper
Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
arXiv · 2603.22582
The Takeaway
It breaks the assumption that Chain-of-Thought (CoT) provides a faithful window into a model's actual reasoning process. This indicates that 'thinking' tokens can be as unfaithful as final answers, complicating safety and interpretability efforts for reasoning-heavy models like o1 or DeepSeek-R1.
From the abstract
Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem