LLMpedia exposes a massive gap in LLM factuality by generating 1M articles from parametric memory, revealing that actual knowledge retrieval is 15%+ lower than multiple-choice benchmarks suggest.
March 26, 2026
Original Paper
LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale
arXiv · 2603.24080
The Takeaway
It challenges the 'factuality saturation' suggested by high MMLU scores, showing that frontier models still fail significantly when forced to generate content rather than select it. The project provides the first fully open parametric encyclopedia, bridging the gap between benchmark evaluation and real-world knowledge materialization.
From the abstract
Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90\%. We show this picture is incomplete. \emph{LLMpedia} generates encyclopedic articles entirely from parametric memory, producing ${\sim}$1M articles across three model families without retrieval. For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7\% -- more than 15 percentage points below the benchmark-based picture, consistent with the availability bias