AI & ML Nature Is Weird

Two distinct populations of internal features drive how an LLM handles being wrong versus being unsure.

April 23, 2026

Original Paper

Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

Het Patel, Tiejin Chen, Hua Wei, Evangelos E. Papalexakis, Jia Chen

arXiv · 2604.19974

The Takeaway

Sparse autoencoders reveal that a model's internal feeling of uncertainty is mechanically separate from its actual correctness. Many features are confounded, meaning they mix signals of doubt with signals of inaccuracy in a way that degrades performance. Suppressing these specific confounded features through targeted intervention makes the model significantly more accurate. This functional dissociation proves that an AI can be confident and wrong for very specific structural reasons. Practitioners can now clean up these internal signals to build models that are both smarter and more honest about their own limits.

From the abstract

Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identi

Read the original paper →

← Back to today's papers