Mechanistic analysis reveals that over-refusal and harmful-intent refusal in LLMs occupy distinct representation subspaces.
March 31, 2026
Original Paper
Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
arXiv · 2603.27518
The Takeaway
It shows that while harmful intent is task-agnostic, over-refusal (e.g., refusing to 'kill' a computer process) is task-dependent and spans higher-dimensional clusters. This explains why global direction-ablation methods fail to fix over-refusal and proves that task-specific interventions are required for reliable safety alignment.
From the abstract
Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understan