Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn — 2025-02-21 — arXiv

Summary

Develops and validates six statistical forecasting methods to predict future language model agent performance on safety-relevant benchmarks, using backtesting on 38 models and making concrete predictions for 2026 capabilities on SWE-Bench Verified, Cybench, and RE-Bench.

Key Result

Forecasts predict that by early 2026, state-of-the-art LM agents will reach 87% success rate on SWE-Bench Verified (software development) while non-specialized agents will reach 54%.

Source

Link: https://arxiv.org/abs/2502.15850
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals

autonomy-evals

AI Safety Compendium

Explorer

Forecasting Frontier Language Model Agent Capabilities

Forecasting Frontier Language Model Agent Capabilities

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Forecasting Frontier Language Model Agent Capabilities

Forecasting Frontier Language Model Agent Capabilities

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents