PaperBench: Evaluating AI’s Ability to Replicate AI Research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Chan Jun Shern, Leon Maksin, … (+7 more) — 2025-04-02 — OpenAI, OpenAI Preparedness Team — arXiv
Summary
Introduces PaperBench, a benchmark evaluating AI agents’ ability to replicate 20 ICML 2024 papers from scratch, containing 8,316 individually gradable tasks with rubrics co-developed with paper authors and an LLM-based judge for automatic evaluation.
Key Result
Best-performing agent (Claude 3.5 Sonnet New) achieves 21.0% average replication score, not yet outperforming top ML PhD baseline.
Source
- Link: https://t.co/dHN2N0tUhC
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals