Human concepts are not just straight lines in an AI brain and current interpretability tools are failing to capture their true shape.
Sparse Autoencoders are the primary tool used to understand what's happening inside a language model. This research shows that these tools often dilute or mix up concepts rather than isolating them. Meaning is likely organized in complex manifolds that are too curvy for simple linear probes. This challenges the industry assumption that we have solved the problem of identifying features in models. We need more sophisticated geometric tools to truly see how AI thinks about the world. Understanding AI will require a new kind of non-linear psychology.
Do Sparse Autoencoders Capture Concept Manifolds?
arXiv · 2604.28119
Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so,