AI & ML Breaks Assumption

This paper demonstrates that Sparse Autoencoder (SAE) features in multimodal models are not modular, challenging the core assumption of intervention-based steering.

March 27, 2026

Original Paper

Sparse Visual Thought Circuits in Vision-Language Models

Yunpeng Zhou

arXiv · 2603.25075

The Takeaway

It shows that intervening on multiple 'thought' features simultaneously causes output drift and accuracy degradation. This finding is critical for researchers working on model interpretability and steering, suggesting we cannot treat SAE features as independent LEGO blocks.

From the abstract

Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and d

Read the original paper →

← Back to today's papers