Mechanistic interpretability reveals that LLMs possess 'affect reception' circuits that detect emotional content even when explicit keywords are removed.
March 25, 2026
Original Paper
Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs
arXiv · 2603.22295
The Takeaway
This proves that models aren't just performing keyword matching for sentiment; they have internal representations for situational emotional cues. It validates the use of LLMs for nuanced clinical and psychological analysis where keywords are absent.
From the abstract
Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims u