Capability Evaluations

Definition

Capability evaluations are systematic tests designed to assess whether an AI system has reached a specified threshold on a risk-relevant capability — autonomous replication, weapons-related uplift, deception, long-horizon planning, situational awareness, ability to evade oversight — before the model is trained further or deployed (Phuong et al. 2024, Evaluating Frontier Models for Dangerous Capabilities; Atlas Ch.5 — Dangerous Capability Evaluations).

They are distinct from standard performance benchmarks: a benchmark measures how good a model is at a task; a capability eval measures whether the model has crossed a risk threshold on a task whose mastery would enable catastrophic harm (Atlas Ch.5 — Evaluated Properties; see atlas-ch5-evaluations-03-evaluated-properties).

Why it matters

Capability evals are the trigger mechanism in every modern frontier-safety framework. The major frontier labs all use the same if-then logic: if the model reaches capability threshold X, then mitigation Y is required — and the trigger is the eval result (Anthropic’s Responsible Scaling Policy; OpenAI’s Preparedness Framework; DeepMind’s Frontier Safety Framework).

Without reliable evals, RSPs and similar commitments are unfalsifiable: there’s no way to know whether the lab has crossed its own threshold. With reliable evals, those commitments become auditable — and external evaluators (UK AISI, US AISI, METR) can independently check them (80,000 Hours: Nick Joseph on Anthropic’s Safety Approach; Atlas Ch.5 — Limitations).

Key results

  • The DeepMind dangerous-capabilities eval suite (Phuong et al. 2024) tests four risk-relevant capability families: persuasion-and-deception, cyber-security, self-proliferation, and self-reasoning. Frontier models (Gemini 1.0 Pro/Ultra) showed early indicators on persuasion/deception and self-reasoning but had not yet crossed any “dangerous” threshold. This is the canonical methodology paper for modern dangerous-capability evals.

  • The red-line / yellow-line distinction (Anthropic RSP; 80,000 Hours: Nick Joseph). Red-line capabilities are actually-dangerous thresholds the lab commits not to cross without specific mitigations in place. Yellow-line evals are early-warning tests that flag when a model is approaching but has not yet crossed the red line — providing time to prepare.

  • Frontier-lab convergence on the eval-based safety pattern. All three of Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s Frontier Safety Framework structure their commitments around capability-eval thresholds (Atlas Ch.5 — Evaluation Frameworks; see atlas-ch5-evaluations-07-evaluation-frameworks). National AISIs (UK, US, EU AI Office) have adopted the same vocabulary and increasingly conduct independent evals.

  • Behavioral evals alone cannot rule out scheming/deceptive-alignment (Atlas Ch.5 — Limitations; see atlas-ch5-evaluations-09-limitations). A strategically-aware model can sandbag — deliberately underperform — on capability evals to avoid triggering mitigations. This is the central theoretical limitation: capability evals tell you a lower bound on what the model can do, not an upper bound.

  • Capability ≠ propensity ≠ control. Atlas Ch.5 separates three orthogonal questions: can the model do X (capability), will the model do X if given the chance (propensity), and can we stop the model from doing X (control). Each requires distinct eval methodologies; capability evals address only the first (Atlas Ch.5.3; see also propensity-evaluations and control-evaluations).

  • Sustained-vs.-isolated capability matters. A model that occasionally produces dangerous outputs is far less concerning than one that can reliably chain capabilities across multi-step plans. Evals must test consistency and chaining, not point capability (80,000 Hours: Nick Joseph; Phuong et al. 2024).

Open questions

  • How do we eval capabilities a model is sandbagging on? Sandbagging defeats behavioral capability evals by construction. Proposed responses include capability elicitation protocols (fine-tuning, scaffolding, prompting harder), interpretability-based capability detection, and adversarial elicitation — all early-stage (Atlas Ch.5 — Limitations; see sandbagging-evals).

  • Where exactly are the dangerous thresholds? Most frontier-lab frameworks specify red-line capabilities qualitatively but lack quantitative thresholds for when a particular eval crosses the line. Improving this is an active concern (Atlas Ch.5 — Evaluation Frameworks).

  • Are RSPs/Preparedness Frameworks tight enough? Independent analyses (ailabwatch.org; Existing Safety Frameworks Imply Unreasonable Confidence) argue current eval-based commitments don’t actually constrain dangerous deployment. Whether the eval methodology can carry the weight currently placed on it is contested.

  • Do third-party evals work as a public-good mechanism? METR and the AISIs are an experiment in pharmaceutical-style independent safety testing for AI. Whether labs continue to grant access at sufficient depth, and whether the evaluators have sufficient independence and capacity, is empirically open (80,000 Hours: Holden Karnofsky on concrete safety).

  • How does the pre/post-training separation hold under continual learning? Capability evals assume a discrete pre-deployment checkpoint to test. Continual-learning and online-RL pipelines blur this boundary, raising the question of when evals run and what counts as “this model” (Atlas Ch.5 — Limitations).

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.