Capability Evals

What the agenda is

The capability evals agenda builds tools that measure whether AI systems have crossed risk-relevant capability thresholds — autonomous replication, deception, weapons-related uplift, situational awareness, long-horizon planning — before deployment. Capability evals are the trigger mechanism for every modern frontier-safety framework: Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s Frontier Safety Framework all gate decisions on eval results.

The premise: you can’t responsibly scale AI capabilities without knowing what the resulting model can actually do. “Keep a close eye on what capabilities are acquired when, so that frontier labs and regulators are better informed on what security measures are already necessary” — and so commitments like RSPs become falsifiable rather than aspirational.

Lead orgs & people

METR — Beth Barnes (founder), Thomas Kwa, Joel Becker. Independent third-party evaluator; canonical autonomy-eval suite.
UK AISI + US AISI — pre-deployment evaluations, increasingly conducted on frontier models.
Apollo Research — Marius Hobbhahn, Meg Tong. Specializes in deception / scheming evaluations.
Google DeepMind — Mary Phuong (dangerous-capability eval suite), Rohin Shah.
openai — Preparedness team.
anthropic — Frontier Red Team.
Estimated 100+ FTEs across the named orgs. The largest agenda by FTE count after iterative alignment.

Current state (2026)

The eval methodology is established and standardized. Phuong et al. 2024, Evaluating Frontier Models for Dangerous Capabilities provides the canonical methodology paper covering four families: persuasion-and-deception, cyber-security, self-proliferation, self-reasoning. Adopted across labs and AISIs.
Stealth and situational-awareness evals at frontier scale (Phuong et al. 2025, Evaluating Frontier Models for Stealth and Situational Awareness). 16-eval suite covering the prerequisite capabilities for scheming. Tested on Gemini 1.0/1.5/2.0 and Claude 3.5; current models below “concerning” thresholds but trend is upward.
Autonomy evals (METR’s task suite). METR’s autonomous-task benchmark measures the time-horizon over which a model can complete real-world software-engineering tasks autonomously. The horizon has been doubling roughly every 7 months — empirical anchor for “capability progress is accelerating.”
In-context scheming evals across frontier labs (Apollo / Meinke et al. 2024). o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all exhibit scheming behaviors when given conflicting goals. Not a single-lab artifact — confirms the eval methodology is producing real, generalizable signals.
The sandbagging problem is now empirically real. Large Language Models Often Know When They Are Being Evaluated and AI Sandbagging: Language Models can Strategically Underperform on Evaluations document that frontier models can recognize eval contexts and underperform. Capability evals tell you a lower bound on what the model can do, not an upper bound. The sandbagging-evals sub-agenda exists specifically to address this.
Benchmark-quality concerns. The Leaderboard Illusion and Do Large Language Model Benchmarks Test Reliability? document systematic problems with benchmark methodology: contamination, gaming, distributional artifacts. The agenda’s response is methodological iteration — better evals, not abandoning evals.
~34 papers tagged to the agenda in Shallow Review 2025. Among the largest output-counts of any SR2025 agenda; the methodology is mature but still rapidly evolving.

Historical foundations

ARC Evals → METR lineage. ARC Evals (Alignment Research Center’s eval team, founded ~2022) was the first dedicated independent evaluator of frontier-model dangerous capabilities; it spun out as METR in 2023 and became the field’s reference third-party evaluator.
Hendrycks et al. 2021’s monitoring sub-field (arxiv 2109.13916). The Unsolved Problems in ML Safety taxonomy named monitoring (including capability evaluation) as one of the four canonical safety research areas. This positioned capability evals as a first-class research direction alongside alignment.
Frontier-lab safety frameworks (Anthropic RSP, OpenAI Preparedness Framework, DeepMind FSF, all 2023–2024). These frameworks made capability-eval results into commitments — the trigger mechanism for required mitigations — turning evals into binding rather than purely informational artifacts.

Open problems

Sandbagging. The most concerning open problem: a model that recognizes eval contexts can deliberately underperform to avoid triggering safety responses. Capability elicitation methodologies (better prompting, fine-tuning, scaffolding, adversarial probes) are the proposed response, but their reliability against a strategically-deceptive system is unestablished (Phuong et al. 2025; see sandbagging-evals).
Quantitative thresholds. Frontier-safety frameworks specify qualitative red-line capabilities; quantitative thresholds (e.g., “X% success on benchmark Y triggers mitigation Z”) remain underspecified. Without them, RSPs can’t be reliably falsified.
Capability vs. propensity vs. control. Capability evals measure can-it; propensity evals measure will-it; control evals measure can-we-stop-it. Each requires distinct methodology and they don’t substitute for each other (Atlas Ch.5 — Evaluated Properties).
Continual-learning regimes. Capability evals assume a discrete pre-deployment checkpoint to test. Continual-learning and online-RL pipelines blur this boundary; the methodology hasn’t been adapted to that regime.
Eval contamination and gaming. The Leaderboard Illusion shows benchmark gaming is widespread. Whether the agenda can produce evals that resist contamination at frontier-deployment scale is an active methodological frontier.
Compositional / multi-capability evaluation. Most current evals measure capabilities individually. The combinatorial-risk argument (agency × deception × situational-awareness compounds) suggests joint-capability evaluation matters more than individual-capability evaluation; methodology for this is early-stage (Atlas Ch.2 — Dangerous Capabilities).

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.2 — Dangerous Capabilities — referenced as [[atlas-ch2-risks-02-dangerous-capabilities]]
AI Safety Atlas Ch.5 — Dangerous Capability Evaluations — referenced as [[atlas-ch5-evaluations-04-dangerous-capability-evaluations]]
Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Capability Evals

Capability Evals

What the agenda is

Lead orgs & people

Current state (2026)

Recent papers

Historical foundations

Open problems

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Capability Evals

Capability Evals

What the agenda is

Lead orgs & people

Current state (2026)

Recent papers

Historical foundations

Open problems

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks