Extreme neural network sparsification causes a catastrophic interpretability collapse even when global accuracy remains stable.
March 20, 2026
Original Paper
Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse
arXiv · 2603.18056
The Takeaway
The paper identifies a fundamental limit in the trade-off between model compression and mechanistic interpretability, where local features vanish into 'dead neurons' at high sparsity levels. This finding is critical for researchers working on Sparse Autoencoders (SAEs) or model pruning, suggesting that 'interpretable' and 'efficient' representations may be fundamentally at odds at high compression ratios.
From the abstract
Extreme neural network sparsification (90% activation reduction) presents a critical challenge for mechanistic interpretability: understanding whether interpretable features survive aggressive compression. This work investigates feature survival under severe capacity constraints in hybrid Variational Autoencoder--Sparse Autoencoder (VAE-SAE) architectures. We introduce an adaptive sparsity scheduling framework that progressively reduces active neurons from 500 to 50 over 50 training epochs, and