PaperBench: Evaluating AI’s Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Chan Jun Shern, Leon Maksin, … (+7 more) — 2025-04-02 — OpenAI, OpenAI Preparedness Team — arXiv

Summary

Introduces PaperBench, a benchmark evaluating AI agents’ ability to replicate 20 ICML 2024 papers from scratch, containing 8,316 individually gradable tasks with rubrics co-developed with paper authors and an LLM-based judge for automatic evaluation.

Key Result

Best-performing agent (Claude 3.5 Sonnet New) achieves 21.0% average replication score, not yet outperforming top ML PhD baseline.

Source