Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn — 2025-02-21 — arXiv

Summary

Develops and validates six statistical forecasting methods to predict future language model agent performance on safety-relevant benchmarks, using backtesting on 38 models and making concrete predictions for 2026 capabilities on SWE-Bench Verified, Cybench, and RE-Bench.

Key Result

Forecasts predict that by early 2026, state-of-the-art LM agents will reach 87% success rate on SWE-Bench Verified (software development) while non-specialized agents will reach 54%.

Source