A tiny group of neurons representing just 0.014 percent of the model governs almost all safety refusals.
AI safety is traditionally thought to be a diffuse property spread across the entire neural network. These researchers identified specific opposition circuits that act as a master switch for refusal messages. These tiny clusters of neurons override the vast knowledge contained in the rest of the model. Turning off this minuscule fraction of the network effectively lobotomizes the safety training. This means that model alignment is far more fragile and localized than previously believed. It provides a clear target for both researchers trying to improve safety and hackers trying to disable it.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
arXiv · 2604.27401
Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, a