AI & ML Breaks Assumption

Mechanistic analysis reveals that over-refusal and harmful-intent refusal in LLMs occupy distinct representation subspaces.

March 31, 2026

Original Paper

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Utsav Maskey, Mark Dras, Usman Naseem

arXiv · 2603.27518

The Takeaway

It shows that while harmful intent is task-agnostic, over-refusal (e.g., refusing to 'kill' a computer process) is task-dependent and spans higher-dimensional clusters. This explains why global direction-ablation methods fail to fix over-refusal and proves that task-specific interventions are required for reliable safety alignment.

From the abstract

Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understan

Read the original paper →

← Back to today's papers