RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, … (+17 more) — 2024-11-22 — Open Philanthropy, Various research institutions — arXiv
Summary
Introduces RE-Bench, a benchmark consisting of 7 challenging ML research engineering environments with human expert baseline data from 71 attempts, evaluating whether frontier AI agents can match or exceed human R&D capabilities.
Key Result
AI agents achieve 4x human expert scores at 2-hour budgets, but humans narrow the gap with longer time budgets, achieving 2x AI scores at 32 hours.
Source
- Link: https://arxiv.org/abs/2411.15114
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals
- Editorial blurb (verbatim):
[RE-Bench: Evaluating frontier AI R\&D capabilities of language model agents against human experts](https://arxiv.org/abs/2411.15114)