Nature Is Weird / AI

Large models know they are about to lie before they even output the first token, but small models are completely clueless.

The Takeaway

This study finds that once a model passes the 400M parameter threshold, it develops an internal representation of 'truthfulness.' You can literally see a hallucination signal in the internal states before the model starts generating text. Smaller models don't have this; they are effectively confabulating without any internal awareness of their own errors. This means that for models of a certain scale, we can build 'pre-output' filters that kill hallucinations before the user ever sees them. It proves that self-awareness of truth is an emergent property of scale.

By SeriesFusion Editorial Board · April 17, 2026

Original Paper

Before the First Token: Scale-dependent Emergence of Hallucination Signals in Autoregressive Language Models

SSRN · 6465859

From the abstract

When do large language models make decisions to generate fake information? Doing so in fields <br> such as health care, law, science research, or making financial decisions has serious consequences; <br> but still there are few formal answers to this question. Recent studies have shown that there are <br> differences in how autoregressive language models represent internally whether they are providing <br> factual versus fictional responses, demonstrating that these models have some form of inte

Read the original paper →

← Back to today's papers