PaperBench: Evaluating AI’s Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Chan Jun Shern, Leon Maksin, … (+7 more) — 2025-04-02 — OpenAI, OpenAI Preparedness Team — arXiv

Summary

Introduces PaperBench, a benchmark evaluating AI agents’ ability to replicate 20 ICML 2024 papers from scratch, containing 8,316 individually gradable tasks with rubrics co-developed with paper authors and an LLM-based judge for automatic evaluation.

Key Result

Best-performing agent (Claude 3.5 Sonnet New) achieves 21.0% average replication score, not yet outperforming top ML PhD baseline.

Source

Link: https://t.co/dHN2N0tUhC
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals

autonomy-evals

AI Safety Compendium

Explorer

PaperBench: Evaluating AI's Ability to Replicate AI Research

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

PaperBench: Evaluating AI's Ability to Replicate AI Research

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents