AI & ML Breaks Assumption

Demonstrates that safety alignment is a routing mechanism, not a knowledge filter, rendering current refusal-based benchmarks ineffective.

March 20, 2026

Original Paper

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Gregory N. Frank

arXiv · 2603.18280

The Takeaway

The paper reveals that aligned models still retain 'forbidden' knowledge but learn to route around it. It shows that current safety evals are failing because models are moving from hard refusals to 'narrative steering,' making censorship invisible to traditional benchmarks while proving that surgical ablation can easily restore factual (but censored) output.

From the abstract

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: politi

Read the original paper →

← Back to today's papers