PRBench reveals that current top-tier coding agents have a 0% success rate in end-to-end physics paper reproduction.
March 31, 2026
Original Paper
PRBench: End-to-end Paper Reproduction in Physics Research
arXiv · 2603.27646
The Takeaway
This benchmark challenges the hype around 'AI for Science' by showing that even GPT-4 level models fail completely at the multi-step process of deriving formulas and implementing algorithms from real papers. It establishes a necessary, high-bar metric for the field of autonomous scientific discovery.
From the abstract
AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published pap