BenchBench shifts the focus from model performance to model 'designer' capability by benchmarking automated benchmark generation.
March 24, 2026
Original Paper
BenchBench: Benchmarking Automated Benchmark Generation
arXiv · 2603.20807
The Takeaway
As benchmarks saturate and suffer from contamination, the ability of LLMs to generate valid, discriminating evaluation suites becomes critical. This paper provides the first systematic dataset and pipeline for auditing the quality of LLM-generated benchmarks.
From the abstract
Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset f