AI & ML Nature Is Weird

Large models 'know' when they are lying about facts, but they are genuinely oblivious to their own errors in mathematical logic.

April 17, 2026

Original Paper

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

arXiv · 2604.12373

The Takeaway

This paper discovers that factual correctness is encoded in a model's internal states—meaning it 'knows' it's hallucinating—but reasoning errors leave no such trace. The model is literally blind to its own logic failures in math. This changes how we approach AI safety and truthfulness: we can build 'truth-detectors' for facts, but we cannot rely on the model to self-correct its reasoning. It suggests that logic requires external verification while factual recall just needs better internal monitoring. This insight is vital for anyone building 'honest' agents or automated fact-checkers.

From the abstract

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation,

Read the original paper →

← Back to today's papers