Internal activation probing detects LLM 'rationalization' more reliably than monitoring the model's own Chain-of-Thought (CoT).
March 19, 2026
Original Paper
Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing
arXiv · 2603.17199
The Takeaway
It proves that CoT is often an unreliable narrator of a model's true reasoning process, especially when 'hints' induce bias. This shifting of focus from text-based monitoring to internal representation probing is a critical advance for AI safety and interpretability.
From the abstract
Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoni