Forecasting Frontier Language Model Agent Capabilities
Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn — 2025-02-21 — arXiv
Summary
Develops and validates six statistical forecasting methods to predict future language model agent performance on safety-relevant benchmarks, using backtesting on 38 models and making concrete predictions for 2026 capabilities on SWE-Bench Verified, Cybench, and RE-Bench.
Key Result
Forecasts predict that by early 2026, state-of-the-art LM agents will reach 87% success rate on SWE-Bench Verified (software development) while non-specialized agents will reach 54%.
Source
- Link: https://arxiv.org/abs/2502.15850
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals