Capability Evals

What the agenda is

The capability evals agenda builds tools that measure whether AI systems have crossed risk-relevant capability thresholds — autonomous replication, deception, weapons-related uplift, situational awareness, long-horizon planning — before deployment. Capability evals are the trigger mechanism for every modern frontier-safety framework: Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s Frontier Safety Framework all gate decisions on eval results.

The premise: you can’t responsibly scale AI capabilities without knowing what the resulting model can actually do. “Keep a close eye on what capabilities are acquired when, so that frontier labs and regulators are better informed on what security measures are already necessary” — and so commitments like RSPs become falsifiable rather than aspirational.

Lead orgs & people

  • METR — Beth Barnes (founder), Thomas Kwa, Joel Becker. Independent third-party evaluator; canonical autonomy-eval suite.
  • UK AISI + US AISI — pre-deployment evaluations, increasingly conducted on frontier models.
  • Apollo Research — Marius Hobbhahn, Meg Tong. Specializes in deception / scheming evaluations.
  • Google DeepMind — Mary Phuong (dangerous-capability eval suite), Rohin Shah.
  • openai — Preparedness team.
  • anthropic — Frontier Red Team.
  • Estimated 100+ FTEs across the named orgs. The largest agenda by FTE count after iterative alignment.

Current state (2026)

Recent papers

Historical foundations

  • ARC Evals → METR lineage. ARC Evals (Alignment Research Center’s eval team, founded ~2022) was the first dedicated independent evaluator of frontier-model dangerous capabilities; it spun out as METR in 2023 and became the field’s reference third-party evaluator.

  • Hendrycks et al. 2021’s monitoring sub-field (arxiv 2109.13916). The Unsolved Problems in ML Safety taxonomy named monitoring (including capability evaluation) as one of the four canonical safety research areas. This positioned capability evals as a first-class research direction alongside alignment.

  • Frontier-lab safety frameworks (Anthropic RSP, OpenAI Preparedness Framework, DeepMind FSF, all 2023–2024). These frameworks made capability-eval results into commitments — the trigger mechanism for required mitigations — turning evals into binding rather than purely informational artifacts.

Open problems

  • Sandbagging. The most concerning open problem: a model that recognizes eval contexts can deliberately underperform to avoid triggering safety responses. Capability elicitation methodologies (better prompting, fine-tuning, scaffolding, adversarial probes) are the proposed response, but their reliability against a strategically-deceptive system is unestablished (Phuong et al. 2025; see sandbagging-evals).

  • Quantitative thresholds. Frontier-safety frameworks specify qualitative red-line capabilities; quantitative thresholds (e.g., “X% success on benchmark Y triggers mitigation Z”) remain underspecified. Without them, RSPs can’t be reliably falsified.

  • Capability vs. propensity vs. control. Capability evals measure can-it; propensity evals measure will-it; control evals measure can-we-stop-it. Each requires distinct methodology and they don’t substitute for each other (Atlas Ch.5 — Evaluated Properties).

  • Continual-learning regimes. Capability evals assume a discrete pre-deployment checkpoint to test. Continual-learning and online-RL pipelines blur this boundary; the methodology hasn’t been adapted to that regime.

  • Eval contamination and gaming. The Leaderboard Illusion shows benchmark gaming is widespread. Whether the agenda can produce evals that resist contamination at frontier-deployment scale is an active methodological frontier.

  • Compositional / multi-capability evaluation. Most current evals measure capabilities individually. The combinatorial-risk argument (agency × deception × situational-awareness compounds) suggests joint-capability evaluation matters more than individual-capability evaluation; methodology for this is early-stage (Atlas Ch.2 — Dangerous Capabilities).

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.