Capability Evaluations
Definition
Capability evaluations are systematic tests designed to assess whether an AI system has reached a specified threshold on a risk-relevant capability — autonomous replication, weapons-related uplift, deception, long-horizon planning, situational awareness, ability to evade oversight — before the model is trained further or deployed (Phuong et al. 2024, Evaluating Frontier Models for Dangerous Capabilities; Atlas Ch.5 — Dangerous Capability Evaluations).
They are distinct from standard performance benchmarks: a benchmark measures how good a model is at a task; a capability eval measures whether the model has crossed a risk threshold on a task whose mastery would enable catastrophic harm (Atlas Ch.5 — Evaluated Properties; see atlas-ch5-evaluations-03-evaluated-properties).
Why it matters
Capability evals are the trigger mechanism in every modern frontier-safety framework. The major frontier labs all use the same if-then logic: if the model reaches capability threshold X, then mitigation Y is required — and the trigger is the eval result (Anthropic’s Responsible Scaling Policy; OpenAI’s Preparedness Framework; DeepMind’s Frontier Safety Framework).
Without reliable evals, RSPs and similar commitments are unfalsifiable: there’s no way to know whether the lab has crossed its own threshold. With reliable evals, those commitments become auditable — and external evaluators (UK AISI, US AISI, METR) can independently check them (80,000 Hours: Nick Joseph on Anthropic’s Safety Approach; Atlas Ch.5 — Limitations).
Key results
-
The DeepMind dangerous-capabilities eval suite (Phuong et al. 2024) tests four risk-relevant capability families: persuasion-and-deception, cyber-security, self-proliferation, and self-reasoning. Frontier models (Gemini 1.0 Pro/Ultra) showed early indicators on persuasion/deception and self-reasoning but had not yet crossed any “dangerous” threshold. This is the canonical methodology paper for modern dangerous-capability evals.
-
The red-line / yellow-line distinction (Anthropic RSP; 80,000 Hours: Nick Joseph). Red-line capabilities are actually-dangerous thresholds the lab commits not to cross without specific mitigations in place. Yellow-line evals are early-warning tests that flag when a model is approaching but has not yet crossed the red line — providing time to prepare.
-
Frontier-lab convergence on the eval-based safety pattern. All three of Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s Frontier Safety Framework structure their commitments around capability-eval thresholds (Atlas Ch.5 — Evaluation Frameworks; see atlas-ch5-evaluations-07-evaluation-frameworks). National AISIs (UK, US, EU AI Office) have adopted the same vocabulary and increasingly conduct independent evals.
-
Behavioral evals alone cannot rule out scheming/deceptive-alignment (Atlas Ch.5 — Limitations; see atlas-ch5-evaluations-09-limitations). A strategically-aware model can sandbag — deliberately underperform — on capability evals to avoid triggering mitigations. This is the central theoretical limitation: capability evals tell you a lower bound on what the model can do, not an upper bound.
-
Capability ≠ propensity ≠ control. Atlas Ch.5 separates three orthogonal questions: can the model do X (capability), will the model do X if given the chance (propensity), and can we stop the model from doing X (control). Each requires distinct eval methodologies; capability evals address only the first (Atlas Ch.5.3; see also propensity-evaluations and control-evaluations).
-
Sustained-vs.-isolated capability matters. A model that occasionally produces dangerous outputs is far less concerning than one that can reliably chain capabilities across multi-step plans. Evals must test consistency and chaining, not point capability (80,000 Hours: Nick Joseph; Phuong et al. 2024).
Open questions
-
How do we eval capabilities a model is sandbagging on? Sandbagging defeats behavioral capability evals by construction. Proposed responses include capability elicitation protocols (fine-tuning, scaffolding, prompting harder), interpretability-based capability detection, and adversarial elicitation — all early-stage (Atlas Ch.5 — Limitations; see sandbagging-evals).
-
Where exactly are the dangerous thresholds? Most frontier-lab frameworks specify red-line capabilities qualitatively but lack quantitative thresholds for when a particular eval crosses the line. Improving this is an active concern (Atlas Ch.5 — Evaluation Frameworks).
-
Are RSPs/Preparedness Frameworks tight enough? Independent analyses (ailabwatch.org; Existing Safety Frameworks Imply Unreasonable Confidence) argue current eval-based commitments don’t actually constrain dangerous deployment. Whether the eval methodology can carry the weight currently placed on it is contested.
-
Do third-party evals work as a public-good mechanism? METR and the AISIs are an experiment in pharmaceutical-style independent safety testing for AI. Whether labs continue to grant access at sufficient depth, and whether the evaluators have sufficient independence and capacity, is empirically open (80,000 Hours: Holden Karnofsky on concrete safety).
-
How does the pre/post-training separation hold under continual learning? Capability evals assume a discrete pre-deployment checkpoint to test. Continual-learning and online-RL pipelines blur this boundary, raising the question of when evals run and what counts as “this model” (Atlas Ch.5 — Limitations).
Related agendas
- capability-evals — the SR2025 agenda tracking ongoing capability-eval research output.
- autonomy-evals — eval class for long-horizon autonomous task completion.
- ai-deception-evals — eval class for deception capabilities.
- ai-scheming-evals — eval class for scheming behavior specifically.
- sandbagging-evals — methodologies for detecting deliberate underperformance on evals.
- self-replication-evals — eval class for autonomous replication.
- wmd-evals-weapons-of-mass-destruction — eval class for biosecurity/cyber/CBRN uplift.
- situational-awareness-and-self-awareness-evals — eval class for the prerequisite of scheming.
- various-redteams — adjacent: structured adversarial probing.
- control — control evals address the orthogonal “can we stop it?” question.
Related concepts
- responsible-scaling-policy — the policy framework capability evals serve as triggers for.
- frontier-safety-frameworks — the broader category Anthropic’s RSP, OpenAI’s PF, and DeepMind’s FSF share.
- ai-safety-levels — the threshold-based hierarchy capability evals classify models into.
- propensity-evaluations — orthogonal: measures will-it rather than can-it.
- control-evaluations — orthogonal: measures can-we-stop-it rather than can-it.
- evaluated-properties — the property taxonomy capability evals belong to.
- evaluation-design, evaluation-techniques, evaluation-frameworks, evaluation-limitations — adjacent eval-methodology concepts.
- deceptive-alignment, scheming — the fundamental limit on what behavioral capability evals can rule out.
- interpretability — proposed complement that could verify capability when behavior cannot.
- ai-control — operational layer that bounds consequences when capability evals can’t rule out misalignment.
- dangerous-capabilities — the property class capability evals target.
Related Pages
- responsible-scaling-policy
- frontier-safety-frameworks
- ai-safety-levels
- propensity-evaluations
- control-evaluations
- evaluated-properties
- evaluation-design
- evaluation-techniques
- evaluation-frameworks
- evaluation-limitations
- deceptive-alignment
- scheming
- interpretability
- ai-control
- dangerous-capabilities
- anthropic
- openai
- deepmind
- metr
- ai-safety-institute
- capability-evals
- autonomy-evals
- ai-deception-evals
- ai-scheming-evals
- sandbagging-evals
- self-replication-evals
- wmd-evals-weapons-of-mass-destruction
- situational-awareness-and-self-awareness-evals
- various-redteams
- control
- nick-joseph
- holden-karnofsky
- ajeya-cotra
- 80k-podcast-nick-joseph-anthropic-safety
- 80k-podcast-holden-karnofsky-concrete-safety
- atlas-ch5-evaluations-00-introduction
- atlas-ch5-evaluations-03-evaluated-properties
- atlas-ch5-evaluations-04-dangerous-capability-evaluations
- atlas-ch5-evaluations-07-evaluation-frameworks
- atlas-ch5-evaluations-09-limitations
- ai-safety-atlas-textbook
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.5 — Dangerous Capability Evaluations — referenced as
[[atlas-ch5-evaluations-04-dangerous-capability-evaluations]] - AI Safety Atlas Ch.5 — Evaluated Properties — referenced as
[[atlas-ch5-evaluations-03-evaluated-properties]] - AI Safety Atlas Ch.5 — Evaluation Frameworks — referenced as
[[atlas-ch5-evaluations-07-evaluation-frameworks]] - AI Safety Atlas Ch.5 — Introduction — referenced as
[[atlas-ch5-evaluations-00-introduction]] - AI Safety Atlas Ch.5 — Limitations — referenced as
[[atlas-ch5-evaluations-09-limitations]] - Summary: 80,000 Hours Podcast — Holden Karnofsky on Concrete AI Safety at Frontier Companies — referenced as
[[80k-podcast-holden-karnofsky-concrete-safety]] - Summary: 80,000 Hours Podcast — Nick Joseph on Anthropic’s Safety Approach — referenced as
[[80k-podcast-nick-joseph-anthropic-safety]]