Paradigm Challenge / Economics

AI models are failing 'elite' tests because the test questions themselves are literally impossible to answer correctly.

The Takeaway

It reveals that AI performance isn't just about the model's intelligence, but about systemic flaws in the gold-standard tests we use to measure them. High-performing models are actually penalized for using correct reasoning that contradicts the flawed 'ground truth' of the benchmark.

By SeriesFusion Editorial Board · March 26, 2026

Original Paper

The Arrival of AGI? When Expert Personas Exceed Expert Benchmarks

Drake Mullens, Stella Shen

SSRN · 6343278

From the abstract

Do expert personas improve language model performance? The Wharton Generative AI Lab reports that they do not, broadcasting to millions via social media the recommendation that practitioners abandon a technique recommended by Anthropic, Google, and OpenAI. We demonstrate that this null finding was structurally predictable. Five core mechanisms precluded detection before data collection began: baseline contamination elevating the starting point to near-ceiling, system prompt hierarchy subordinati

Read the original paper →

← Back to today's papers