RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, … (+17 more) — 2024-11-22 — Open Philanthropy, Various research institutions — arXiv

Summary

Introduces RE-Bench, a benchmark consisting of 7 challenging ML research engineering environments with human expert baseline data from 71 attempts, evaluating whether frontier AI agents can match or exceed human R&D capabilities.

Key Result

AI agents achieve 4x human expert scores at 2-hour budgets, but humans narrow the gap with longer time budgets, achieving 2x AI scores at 32 hours.

Source

Link: https://arxiv.org/abs/2411.15114
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals
Editorial blurb (verbatim): [RE-Bench: Evaluating frontier AI R\&D capabilities of language model agents against human experts](https://arxiv.org/abs/2411.15114)

autonomy-evals

AI Safety Compendium

Explorer

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents