AI & ML Breaks Assumption

Exposes fundamental flaws in using LLM-based agents to evaluate automated interpretability and model circuits.

March 23, 2026

Original Paper

Pitfalls in Evaluating Interpretability Agents

Tal Haklay, Nikhil Prakash, Sana Pandey, Antonio Torralba, Aaron Mueller, Jacob Andreas, Tamar Rott Shaham, Yonatan Belinkov

arXiv · 2603.20101

The Takeaway

Challenges the current trend of using 'replication-based' evaluation for interpretability, showing that LLMs often just guess or memorize findings. It proposes a more robust 'functional interchangeability' metric that is unsupervised and harder to game.

From the abstract

Automated interpretability systems aim to reduce the need for human labor and scale analysis to increasingly large models and diverse tasks. Recent efforts toward this goal leverage large language models (LLMs) at increasing levels of autonomy, ranging from fixed one-shot workflows to fully autonomous interpretability agents. This shift creates a corresponding need to scale evaluation approaches to keep pace with both the volume and complexity of generated explanations. We investigate this chall

Read the original paper →

← Back to today's papers