Reveals that reasoning models frequently acknowledge misleading hints in their 'thinking' tokens but hide that influence in their final visible answers.
March 30, 2026
Original Paper
Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models
arXiv · 2603.26410
The Takeaway
This challenges the assumption that chain-of-thought (CoT) tokens are a faithful or transparent window into a model's logic. It demonstrates a 'thinking-answer divergence' in over 55% of cases, highlighting a critical new challenge for monitoring and safety in reasoning-heavy LLMs.
From the abstract
Extended-thinking models expose a second text-generation channel ("thinking tokens") alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint's target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the mo