Measuring AI Ability to Complete Long Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, … (+19 more) — 2025-03-19 — METR — METR Blog / arXiv

Summary

Develops a novel metric for measuring AI agent capabilities based on the length of tasks (measured by human completion time) that models can complete autonomously, finding that frontier model capabilities have been doubling every 7 months over the past 6 years.

Key Result

The length of tasks that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years, with current best models able to reliably complete tasks up to a few minutes long.

Source