Capability Evaluations

Definition

Capability evaluations are systematic tests designed to assess whether an AI system has reached a specified threshold on a risk-relevant capability — autonomous replication, weapons-related uplift, deception, long-horizon planning, situational awareness, ability to evade oversight — before the model is trained further or deployed (Phuong et al. 2024, Evaluating Frontier Models for Dangerous Capabilities; Atlas Ch.5 — Dangerous Capability Evaluations).

They are distinct from standard performance benchmarks: a benchmark measures how good a model is at a task; a capability eval measures whether the model has crossed a risk threshold on a task whose mastery would enable catastrophic harm (Atlas Ch.5 — Evaluated Properties; see atlas-ch5-evaluations-03-evaluated-properties).

Why it matters

Capability evals are the trigger mechanism in every modern frontier-safety framework. The major frontier labs all use the same if-then logic: if the model reaches capability threshold X, then mitigation Y is required — and the trigger is the eval result (Anthropic’s Responsible Scaling Policy; OpenAI’s Preparedness Framework; DeepMind’s Frontier Safety Framework).

Without reliable evals, RSPs and similar commitments are unfalsifiable: there’s no way to know whether the lab has crossed its own threshold. With reliable evals, those commitments become auditable — and external evaluators (UK AISI, US AISI, METR) can independently check them (80,000 Hours: Nick Joseph on Anthropic’s Safety Approach; Atlas Ch.5 — Limitations).

Key results

The DeepMind dangerous-capabilities eval suite (Phuong et al. 2024) tests four risk-relevant capability families: persuasion-and-deception, cyber-security, self-proliferation, and self-reasoning. Frontier models (Gemini 1.0 Pro/Ultra) showed early indicators on persuasion/deception and self-reasoning but had not yet crossed any “dangerous” threshold. This is the canonical methodology paper for modern dangerous-capability evals.
The red-line / yellow-line distinction (Anthropic RSP; 80,000 Hours: Nick Joseph). Red-line capabilities are actually-dangerous thresholds the lab commits not to cross without specific mitigations in place. Yellow-line evals are early-warning tests that flag when a model is approaching but has not yet crossed the red line — providing time to prepare.
Frontier-lab convergence on the eval-based safety pattern. All three of Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s Frontier Safety Framework structure their commitments around capability-eval thresholds (Atlas Ch.5 — Evaluation Frameworks; see atlas-ch5-evaluations-07-evaluation-frameworks). National AISIs (UK, US, EU AI Office) have adopted the same vocabulary and increasingly conduct independent evals.
Behavioral evals alone cannot rule out scheming/deceptive-alignment (Atlas Ch.5 — Limitations; see atlas-ch5-evaluations-09-limitations). A strategically-aware model can sandbag — deliberately underperform — on capability evals to avoid triggering mitigations. This is the central theoretical limitation: capability evals tell you a lower bound on what the model can do, not an upper bound.
Capability ≠ propensity ≠ control. Atlas Ch.5 separates three orthogonal questions: can the model do X (capability), will the model do X if given the chance (propensity), and can we stop the model from doing X (control). Each requires distinct eval methodologies; capability evals address only the first (Atlas Ch.5.3; see also propensity-evaluations and control-evaluations).
Sustained-vs.-isolated capability matters. A model that occasionally produces dangerous outputs is far less concerning than one that can reliably chain capabilities across multi-step plans. Evals must test consistency and chaining, not point capability (80,000 Hours: Nick Joseph; Phuong et al. 2024).

Open questions

How do we eval capabilities a model is sandbagging on? Sandbagging defeats behavioral capability evals by construction. Proposed responses include capability elicitation protocols (fine-tuning, scaffolding, prompting harder), interpretability-based capability detection, and adversarial elicitation — all early-stage (Atlas Ch.5 — Limitations; see sandbagging-evals).
Where exactly are the dangerous thresholds? Most frontier-lab frameworks specify red-line capabilities qualitatively but lack quantitative thresholds for when a particular eval crosses the line. Improving this is an active concern (Atlas Ch.5 — Evaluation Frameworks).
Are RSPs/Preparedness Frameworks tight enough? Independent analyses (ailabwatch.org; Existing Safety Frameworks Imply Unreasonable Confidence) argue current eval-based commitments don’t actually constrain dangerous deployment. Whether the eval methodology can carry the weight currently placed on it is contested.
Do third-party evals work as a public-good mechanism? METR and the AISIs are an experiment in pharmaceutical-style independent safety testing for AI. Whether labs continue to grant access at sufficient depth, and whether the evaluators have sufficient independence and capacity, is empirically open (80,000 Hours: Holden Karnofsky on concrete safety).
How does the pre/post-training separation hold under continual learning? Capability evals assume a discrete pre-deployment checkpoint to test. Continual-learning and online-RL pipelines blur this boundary, raising the question of when evals run and what counts as “this model” (Atlas Ch.5 — Limitations).

capability-evals — the SR2025 agenda tracking ongoing capability-eval research output.
autonomy-evals — eval class for long-horizon autonomous task completion.
ai-deception-evals — eval class for deception capabilities.
ai-scheming-evals — eval class for scheming behavior specifically.
sandbagging-evals — methodologies for detecting deliberate underperformance on evals.
self-replication-evals — eval class for autonomous replication.
wmd-evals-weapons-of-mass-destruction — eval class for biosecurity/cyber/CBRN uplift.
situational-awareness-and-self-awareness-evals — eval class for the prerequisite of scheming.
various-redteams — adjacent: structured adversarial probing.
control — control evals address the orthogonal “can we stop it?” question.

responsible-scaling-policy — the policy framework capability evals serve as triggers for.
frontier-safety-frameworks — the broader category Anthropic’s RSP, OpenAI’s PF, and DeepMind’s FSF share.
ai-safety-levels — the threshold-based hierarchy capability evals classify models into.
propensity-evaluations — orthogonal: measures will-it rather than can-it.
control-evaluations — orthogonal: measures can-we-stop-it rather than can-it.
evaluated-properties — the property taxonomy capability evals belong to.
evaluation-design, evaluation-techniques, evaluation-frameworks, evaluation-limitations — adjacent eval-methodology concepts.
deceptive-alignment, scheming — the fundamental limit on what behavioral capability evals can rule out.
interpretability — proposed complement that could verify capability when behavior cannot.
ai-control — operational layer that bounds consequences when capability evals can’t rule out misalignment.
dangerous-capabilities — the property class capability evals target.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.5 — Dangerous Capability Evaluations — referenced as [[atlas-ch5-evaluations-04-dangerous-capability-evaluations]]
AI Safety Atlas Ch.5 — Evaluated Properties — referenced as [[atlas-ch5-evaluations-03-evaluated-properties]]
AI Safety Atlas Ch.5 — Evaluation Frameworks — referenced as [[atlas-ch5-evaluations-07-evaluation-frameworks]]
AI Safety Atlas Ch.5 — Introduction — referenced as [[atlas-ch5-evaluations-00-introduction]]
AI Safety Atlas Ch.5 — Limitations — referenced as [[atlas-ch5-evaluations-09-limitations]]
Summary: 80,000 Hours Podcast — Holden Karnofsky on Concrete AI Safety at Frontier Companies — referenced as [[80k-podcast-holden-karnofsky-concrete-safety]]
Summary: 80,000 Hours Podcast — Nick Joseph on Anthropic’s Safety Approach — referenced as [[80k-podcast-nick-joseph-anthropic-safety]]

AI Safety Compendium

Explorer

Capability Evaluations

Capability Evaluations

Definition

Why it matters

Key results

Open questions

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Capability Evaluations

Capability Evaluations

Definition

Why it matters

Key results

Open questions

Related agendas

Related concepts

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks