AI & ML Breaks Assumption

Sparse Autoencoders (SAEs) fail at compositional generalization due to flawed dictionary learning, not the inference method.

March 31, 2026

Original Paper

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Vitória Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, David Klindt

arXiv · 2603.28744

The Takeaway

It identifies a fundamental 'amortization gap' in SAEs and proves that current dictionary learning methods are the primary bottleneck in neural interpretability. This shifts the research focus from improving SAE encoders to the more difficult challenge of scalable dictionary learning for sparse inference.

From the abstract

The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent