Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs
arXiv:2603.22295v1 Announce Type: new Abstract: Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology -- clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods -- linear probing, causal activation patching, knockout experiments, and representational geometry -- and discove
arXiv:2603.22295v1 Announce Type: new Abstract: Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology -- clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods -- linear probing, causal activation patching, knockout experiments, and representational geometry -- and discover two dissociable emotion processing mechanisms. Affect reception -- detecting emotionally significant content -- operates with near-perfect accuracy (AUROC 1.000), consistent with early-layer saturation, and replicates across all six models. Emotion categorization -- mapping affect to specific emotion labels -- is partially keyword-dependent, dropping 1-7% without keywords and improving with scale. Causal activation patching confirms keyword-rich and keyword-free stimuli share representational space, transferring affective salience rather than emotion-category identity. These findings falsify the keyword-spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models -- with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.
Executive Summary
This article presents a groundbreaking empirical challenge to the prevailing hypothesis that large language models detect emotions via keyword recognition. Using mechanistic interpretability techniques grounded in clinical psychology, the authors replace explicit emotion keywords with situational and behavioral stimuli across six large language models. Their findings reveal two dissociable mechanisms: affect reception—accurately detecting emotionally salient content without keyword dependence—and emotion categorization—mapping affect to labels, which is partially keyword-dependent. The study falsifies the keyword-spotting hypothesis, introduces a novel mechanistic dissociation, and establishes a clinical stimulus methodology as a rigorous standard for evaluating emotion processing in LLMs. The implications extend beyond academic discourse into AI safety and alignment frameworks.
Key Points
- ▸ Discovery of dissociable emotion processing mechanisms (affect reception vs. categorization)
- ▸ Falsification of keyword-spotting hypothesis via clinical stimulus methodology
- ▸ Validation of mechanistic interpretability as a tool for rigorous AI evaluation
Merits
Scientific Rigor
The use of clinical vignettes and multiple interpretability methods (linear probing, causal patching, knockout, representational geometry) constitutes a high standard of empirical validation.
Demerits
Scope Constraint
The study focuses on specific model families and stimuli; broader generalization to other domains or architectures remains to be validated.
Expert Commentary
This paper represents a pivotal shift in the discourse around AI emotion processing. Historically, claims about 'emotion circuits' in LLMs were conflated with lexical recognition—this study disentangles the cognitive architecture of affective perception from semantic label recognition. The dissociation between affect reception (semantic/contextual salience) and categorization (label mapping) has profound implications for both theoretical modeling of affect in neural networks and applied use cases, particularly in clinical AI applications, content moderation, and ethical deployment. The clinical methodology introduced here should become a baseline for future research; its release of stimuli, code, and data enhances reproducibility and accelerates the path toward more robust AI governance. Moreover, the findings challenge the assumptions underpinning many commercial AI systems that rely on emotion detection for user interaction or content curation—requiring recalibration of their internal models and evaluation metrics under new interpretability paradigms.
Recommendations
- ✓ 1. Incorporate clinical stimulus methodologies into standard AI evaluation frameworks for affective processing.
- ✓ 2. Encourage peer-reviewed replication studies using the released stimuli and code to validate findings across diverse model architectures and domains.
Sources
Original: arXiv - cs.CL