Large models know they are about to lie before they even output the first token, but small models are completely clueless.
April 17, 2026
Original Paper
Before the First Token: Scale-dependent Emergence of Hallucination Signals in Autoregressive Language Models
SSRN · 6465859
The Takeaway
This study finds that once a model passes the 400M parameter threshold, it develops an internal representation of 'truthfulness.' You can literally see a hallucination signal in the internal states before the model starts generating text. Smaller models don't have this; they are effectively confabulating without any internal awareness of their own errors. This means that for models of a certain scale, we can build 'pre-output' filters that kill hallucinations before the user ever sees them. It proves that self-awareness of truth is an emergent property of scale.
From the abstract
When do large language models make decisions to generate fake information? Doing so in fields <br> such as health care, law, science research, or making financial decisions has serious consequences; <br> but still there are few formal answers to this question. Recent studies have shown that there are <br> differences in how autoregressive language models represent internally whether they are providing <br> factual versus fictional responses, demonstrating that these models have some form of inte