Confident AI hallucinations leave a physical fingerprint in the loss landscape that can be detected by stressing the model gradients.
LLM errors that look like facts often reside in sharp minima, meaning the model certainty spikes and crashes when input is slightly changed. Real knowledge is represented by flat facts that remain stable under pressure. By measuring how much a model gradient spikes during slight perturbations, practitioners can catch lies in real time. This geometric signature allows for the detection of stubborn errors without needing a ground-truth database. It shifts the problem of hallucination from a linguistic guessing game to a measurable physical property of the network.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
arXiv · 2605.00939
Traditional hallucination detection fails on "Stubborn Hallucinations" -- errors where LLMs are confidently wrong. We propose a geometric solution: Embedding-Perturbed Gradient Sensitivity (EPGS). We hypothesize that while robust facts reside in flat minima, stubborn hallucinations sit in sharp minima, supported by brittle memorization. EPGS detects this sharpness by perturbing input embeddings with Gaussian noise and measuring the resulting spike in gradient magnitude. This acts as an efficient