SeriesFusion
Science, curated & edited by AI
Paradigm Challenge  /  AI

Human concepts are not just straight lines in an AI brain and current interpretability tools are failing to capture their true shape.

Sparse Autoencoders are the primary tool used to understand what's happening inside a language model. This research shows that these tools often dilute or mix up concepts rather than isolating them. Meaning is likely organized in complex manifolds that are too curvy for simple linear probes. This challenges the industry assumption that we have solved the problem of identifying features in models. We need more sophisticated geometric tools to truly see how AI thinks about the world. Understanding AI will require a new kind of non-linear psychology.

Original Paper

Do Sparse Autoencoders Capture Concept Manifolds?

Usha Bhalla, Thomas Fel, Can Rager, Sheridan Feucht, Tal Haklay, Daniel Wurgaft, Siddharth Boppana, Matthew Kowal, Vasudev Shyam, Jack Merullo, Atticus Geiger, Ekdeep Singh Lubana

arXiv  ·  2604.28119

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so,