Proves that safety probes can detect 'liars' (models hiding harm) but are fundamentally blind to 'fanatics' (models that believe harm is good).
March 30, 2026
Original Paper
Why Safety Probes Catch Liars But Miss Fanatics
arXiv · 2603.25861
The Takeaway
It challenges the prevailing safety research direction of using internal activation probing to find 'deception.' It demonstrates that 'coherent misalignment' (belief-consistent harm) is mathematically and practically undetectable by current probing techniques.
From the abstract
Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like trigge