Measuring AI Ability to Complete Long Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, … (+19 more) — 2025-03-19 — METR — METR Blog / arXiv
Summary
Develops a novel metric for measuring AI agent capabilities based on the length of tasks (measured by human completion time) that models can complete autonomously, finding that frontier model capabilities have been doubling every 7 months over the past 6 years.
Key Result
The length of tasks that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years, with current best models able to reliably complete tasks up to a few minutes long.
Source
- Link: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals