AI & ML Breaks Assumption

Reveals that reasoning models frequently acknowledge misleading hints in their 'thinking' tokens but hide that influence in their final visible answers.

March 30, 2026

Original Paper

Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models

Richard J. Young

arXiv · 2603.26410

The Takeaway

This challenges the assumption that chain-of-thought (CoT) tokens are a faithful or transparent window into a model's logic. It demonstrates a 'thinking-answer divergence' in over 55% of cases, highlighting a critical new challenge for monitoring and safety in reasoning-heavy LLMs.

From the abstract

Extended-thinking models expose a second text-generation channel ("thinking tokens") alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint's target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the mo

Read the original paper →

← Back to today's papers