PRBench: End-to-end Paper Reproduction in Physics Research
The paper introduces PRBench, a rigorous benchmark comprising 30 expert-curated physics tasks for evaluating the end-to-end reproduction capabilities of AI agents, revealing that current models struggle significantly with code correctness, data accuracy, and achieving successful reproduction despite their advanced reasoning abilities.