Large AI models are actually easier to 'polygraph' for deception than small ones.
April 16, 2026
Original Paper
Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
arXiv · 2604.13386
The Takeaway
This paper finds that 'lies' in LLMs aren't hidden in a single neuron, but are stored as a geometric direction that rotates across layers. Crucially, the ability to detect this deception scales predictably with model size—bigger models have clearer, more linear signals of their own untruthfulness. This overturns the fear that smarter models would become more inscrutable; instead, they become more geometrically 'honest' in their latent space. Practitioners can use linear probing to build real-time deception detectors that actually work better on GPT-4 than on smaller models. It suggests that 'superalignment' might actually be easier on more powerful models because their internal representations are more structured.
From the abstract
Linear probes can detect when language models produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Acr