RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, … (+17 more) — 2024-11-22 — Open Philanthropy, Various research institutions — arXiv

Summary

Introduces RE-Bench, a benchmark consisting of 7 challenging ML research engineering environments with human expert baseline data from 71 attempts, evaluating whether frontier AI agents can match or exceed human R&D capabilities.

Key Result

AI agents achieve 4x human expert scores at 2-hour budgets, but humans narrow the gap with longer time budgets, achieving 2x AI scores at 32 hours.

Source