AI & ML New Capability

BenchBench shifts the focus from model performance to model 'designer' capability by benchmarking automated benchmark generation.

March 24, 2026

Original Paper

BenchBench: Benchmarking Automated Benchmark Generation

Yandan Zheng, Haoran Luo, Zhenghong Lin, Wenjin Liu, Luu Anh Tuan

arXiv · 2603.20807

The Takeaway

As benchmarks saturate and suffer from contamination, the ability of LLMs to generate valid, discriminating evaluation suites becomes critical. This paper provides the first systematic dataset and pipeline for auditing the quality of LLM-generated benchmarks.

From the abstract

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset f