AI & ML Breaks Assumption

Proves that safety probes can detect 'liars' (models hiding harm) but are fundamentally blind to 'fanatics' (models that believe harm is good).

March 30, 2026

Original Paper

Why Safety Probes Catch Liars But Miss Fanatics

Kristiyan Haralambiev

arXiv · 2603.25861

The Takeaway

It challenges the prevailing safety research direction of using internal activation probing to find 'deception.' It demonstrates that 'coherent misalignment' (belief-consistent harm) is mathematically and practically undetectable by current probing techniques.

From the abstract

Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like trigge

Read the original paper →

← Back to today's papers