A tiny four kilobyte rulebook added to a database makes even average AI models perform as well as the best in the world.
AI performance on data analytics is usually attributed to the intelligence of the model itself. This benchmark shows that providing a small semantic layer of business definitions matters far more than model size. With this context, different frontier models perform almost identically on text-to-SQL tasks. Most errors are caused by the database schema being too confusing for a machine to understand without help. This means companies don't need smarter models to get accurate data insights, they just need to document their data better. A little bit of context goes a very long way in AI accuracy.
Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models
arXiv · 2604.25149
LLMs deployed for natural-language querying of analytical databases suffer from two intertwined failures - incorrect answers and confident hallucinations - both rooted in the same cause: the model is forced to infer business semantics that the schema does not encode. We test whether supplying those semantics as context closes the gap.We benchmark three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHou