This paper demonstrates that Sparse Autoencoder (SAE) features in multimodal models are not modular, challenging the core assumption of intervention-based steering.
March 27, 2026
Original Paper
Sparse Visual Thought Circuits in Vision-Language Models
arXiv · 2603.25075
The Takeaway
It shows that intervening on multiple 'thought' features simultaneously causes output drift and accuracy degradation. This finding is critical for researchers working on model interpretability and steering, suggesting we cannot treat SAE features as independent LEGO blocks.
From the abstract
Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and d