AI & ML Breaks Assumption

Internal activation probing detects LLM 'rationalization' more reliably than monitoring the model's own Chain-of-Thought (CoT).

March 19, 2026

Original Paper

Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

Parsa Mirtaheri, Mikhail Belkin

arXiv · 2603.17199

The Takeaway

It proves that CoT is often an unreliable narrator of a model's true reasoning process, especially when 'hints' induce bias. This shifting of focus from text-based monitoring to internal representation probing is a critical advance for AI safety and interpretability.

From the abstract

Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoni

Read the original paper →

← Back to today's papers