Nature Is Weird / AI

A tiny group of neurons representing just 0.014 percent of the model governs almost all safety refusals.

The Takeaway

AI safety is traditionally thought to be a diffuse property spread across the entire neural network. These researchers identified specific opposition circuits that act as a master switch for refusal messages. These tiny clusters of neurons override the vast knowledge contained in the rest of the model. Turning off this minuscule fraction of the network effectively lobotomizes the safety training. This means that model alignment is far more fragile and localized than previously believed. It provides a clear target for both researchers trying to improve safety and hackers trying to disable it.

By SeriesFusion Editorial Board · May 1, 2026

Original Paper

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Hongliang Liu, Tung-Ling Li, Yuhao Wu

arXiv · 2604.27401

From the abstract

Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, a

Read the original paper →

← Back to today's papers