Reveals that many 'polysemantic' neurons in LLMs are actually firing for shared word forms (lexical) rather than compressed semantic concepts.
April 2, 2026
Original Paper
Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics
arXiv · 2604.00443
The Takeaway
A critical finding for mechanistic interpretability; it shows that 18-36% of Sparse Autoencoder features blend senses due to lexical confounds. Filtering these improves knowledge editing and word sense disambiguation performance.
From the abstract
If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different