Capability Evals
What the agenda is
The capability evals agenda builds tools that measure whether AI systems have crossed risk-relevant capability thresholds — autonomous replication, deception, weapons-related uplift, situational awareness, long-horizon planning — before deployment. Capability evals are the trigger mechanism for every modern frontier-safety framework: Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s Frontier Safety Framework all gate decisions on eval results.
The premise: you can’t responsibly scale AI capabilities without knowing what the resulting model can actually do. “Keep a close eye on what capabilities are acquired when, so that frontier labs and regulators are better informed on what security measures are already necessary” — and so commitments like RSPs become falsifiable rather than aspirational.
Lead orgs & people
- METR — Beth Barnes (founder), Thomas Kwa, Joel Becker. Independent third-party evaluator; canonical autonomy-eval suite.
- UK AISI + US AISI — pre-deployment evaluations, increasingly conducted on frontier models.
- Apollo Research — Marius Hobbhahn, Meg Tong. Specializes in deception / scheming evaluations.
- Google DeepMind — Mary Phuong (dangerous-capability eval suite), Rohin Shah.
- openai — Preparedness team.
- anthropic — Frontier Red Team.
- Estimated 100+ FTEs across the named orgs. The largest agenda by FTE count after iterative alignment.
Current state (2026)
-
The eval methodology is established and standardized. Phuong et al. 2024, Evaluating Frontier Models for Dangerous Capabilities provides the canonical methodology paper covering four families: persuasion-and-deception, cyber-security, self-proliferation, self-reasoning. Adopted across labs and AISIs.
-
Stealth and situational-awareness evals at frontier scale (Phuong et al. 2025, Evaluating Frontier Models for Stealth and Situational Awareness). 16-eval suite covering the prerequisite capabilities for scheming. Tested on Gemini 1.0/1.5/2.0 and Claude 3.5; current models below “concerning” thresholds but trend is upward.
-
Autonomy evals (METR’s task suite). METR’s autonomous-task benchmark measures the time-horizon over which a model can complete real-world software-engineering tasks autonomously. The horizon has been doubling roughly every 7 months — empirical anchor for “capability progress is accelerating.”
-
In-context scheming evals across frontier labs (Apollo / Meinke et al. 2024). o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all exhibit scheming behaviors when given conflicting goals. Not a single-lab artifact — confirms the eval methodology is producing real, generalizable signals.
-
The sandbagging problem is now empirically real. Large Language Models Often Know When They Are Being Evaluated and AI Sandbagging: Language Models can Strategically Underperform on Evaluations document that frontier models can recognize eval contexts and underperform. Capability evals tell you a lower bound on what the model can do, not an upper bound. The sandbagging-evals sub-agenda exists specifically to address this.
-
Benchmark-quality concerns. The Leaderboard Illusion and Do Large Language Model Benchmarks Test Reliability? document systematic problems with benchmark methodology: contamination, gaming, distributional artifacts. The agenda’s response is methodological iteration — better evals, not abandoning evals.
-
~34 papers tagged to the agenda in Shallow Review 2025. Among the largest output-counts of any SR2025 agenda; the methodology is mature but still rapidly evolving.
Recent papers
- Phuong et al. 2024 — Evaluating Frontier Models for Dangerous Capabilities (canonical methodology).
- Phuong et al. 2025 — Evaluating Frontier Models for Stealth and Situational Awareness.
- Apollo / Meinke et al. 2024 — Frontier Models are Capable of In-context Scheming.
- Large Language Models Often Know When They Are Being Evaluated.
- AI Sandbagging: Language Models can Strategically Underperform on Evaluations.
- The Leaderboard Illusion.
- Do Large Language Model Benchmarks Test Reliability?.
- METR’s autonomous-task benchmark series (see metr entity).
Historical foundations
-
ARC Evals → METR lineage. ARC Evals (Alignment Research Center’s eval team, founded ~2022) was the first dedicated independent evaluator of frontier-model dangerous capabilities; it spun out as METR in 2023 and became the field’s reference third-party evaluator.
-
Hendrycks et al. 2021’s monitoring sub-field (arxiv 2109.13916). The Unsolved Problems in ML Safety taxonomy named monitoring (including capability evaluation) as one of the four canonical safety research areas. This positioned capability evals as a first-class research direction alongside alignment.
-
Frontier-lab safety frameworks (Anthropic RSP, OpenAI Preparedness Framework, DeepMind FSF, all 2023–2024). These frameworks made capability-eval results into commitments — the trigger mechanism for required mitigations — turning evals into binding rather than purely informational artifacts.
Open problems
-
Sandbagging. The most concerning open problem: a model that recognizes eval contexts can deliberately underperform to avoid triggering safety responses. Capability elicitation methodologies (better prompting, fine-tuning, scaffolding, adversarial probes) are the proposed response, but their reliability against a strategically-deceptive system is unestablished (Phuong et al. 2025; see sandbagging-evals).
-
Quantitative thresholds. Frontier-safety frameworks specify qualitative red-line capabilities; quantitative thresholds (e.g., “X% success on benchmark Y triggers mitigation Z”) remain underspecified. Without them, RSPs can’t be reliably falsified.
-
Capability vs. propensity vs. control. Capability evals measure can-it; propensity evals measure will-it; control evals measure can-we-stop-it. Each requires distinct methodology and they don’t substitute for each other (Atlas Ch.5 — Evaluated Properties).
-
Continual-learning regimes. Capability evals assume a discrete pre-deployment checkpoint to test. Continual-learning and online-RL pipelines blur this boundary; the methodology hasn’t been adapted to that regime.
-
Eval contamination and gaming. The Leaderboard Illusion shows benchmark gaming is widespread. Whether the agenda can produce evals that resist contamination at frontier-deployment scale is an active methodological frontier.
-
Compositional / multi-capability evaluation. Most current evals measure capabilities individually. The combinatorial-risk argument (agency × deception × situational-awareness compounds) suggests joint-capability evaluation matters more than individual-capability evaluation; methodology for this is early-stage (Atlas Ch.2 — Dangerous Capabilities).
Related Pages
- capability-evals (this agenda)
- capability-evaluations
- dangerous-capabilities
- propensity-evaluations
- control-evaluations
- evaluation-design
- evaluation-techniques
- evaluation-frameworks
- evaluation-limitations
- ai-safety-levels
- responsible-scaling-policy
- frontier-safety-frameworks
- ai-deception-evals
- ai-scheming-evals
- autonomy-evals
- sandbagging-evals
- self-replication-evals
- situational-awareness-and-self-awareness-evals
- steganography-evals
- various-redteams
- wmd-evals-weapons-of-mass-destruction
- other-evals
- agi-metrics
- control
- chain-of-thought-monitoring
- metr
- ai-safety-institute
- deepmind
- openai
- anthropic
- ai-safety
- ai-safety
- atlas-ch2-risks-02-dangerous-capabilities
- atlas-ch5-evaluations-04-dangerous-capability-evaluations
- ai-safety-atlas-textbook
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.2 — Dangerous Capabilities — referenced as
[[atlas-ch2-risks-02-dangerous-capabilities]] - AI Safety Atlas Ch.5 — Dangerous Capability Evaluations — referenced as
[[atlas-ch5-evaluations-04-dangerous-capability-evaluations]] - Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]