Proves that 'topic-matched' contrast pairs are ineffective for extracting refusal directions in LLM abliteration research.
March 24, 2026
Original Paper
On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration
arXiv · 2603.22061
The Takeaway
It challenges the conventional wisdom that comparing harmful prompts to similar harmless topics is the best way to steer model behavior. This insight forces researchers to rethink how they isolate and remove safety-related features from model weights.
From the abstract
Inasmuch as the removal of refusal behavior from instruction-tuned language models by directional abliteration requires the extraction of refusal-mediating directions from the residual stream activation space, and inasmuch as the construction of the contrast baseline against which harmful prompt activations are compared has been treated in the existing literature as an implementation detail rather than a methodological concern, the present work investigates whether a topically matched contrast b