AI & ML Breaks Assumption

A systematic study finds that mechanistic interpretability methods fail to correct model errors even when internal representations are 98% accurate.

March 20, 2026

Original Paper

Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, Rajaie Batniji

arXiv · 2603.18353

The Takeaway

It exposes a massive 'knowledge-action gap,' showing that identifying a feature in a model's latent space does not mean we can successfully steer it to fix errors. This challenges the fundamental assumption of many AI safety frameworks that interpretability enables effective error correction.

From the abstract

Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods -- concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct)

Read the original paper →

← Back to today's papers