Measuring AI Ability to Complete Long Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, … (+19 more) — 2025-03-19 — METR — METR Blog / arXiv

Summary

Develops a novel metric for measuring AI agent capabilities based on the length of tasks (measured by human completion time) that models can complete autonomously, finding that frontier model capabilities have been doubling every 7 months over the past 6 years.

Key Result

The length of tasks that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years, with current best models able to reliably complete tasks up to a few minutes long.

Source

Link: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals

autonomy-evals

AI Safety Compendium

Explorer

Measuring AI Ability to Complete Long Tasks

Measuring AI Ability to Complete Long Tasks

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Measuring AI Ability to Complete Long Tasks

Measuring AI Ability to Complete Long Tasks

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents