Control Evaluations

Control evaluations test whether safety protocols remain effective when AI systems actively attempt to circumvent them. They adopt an adversarial perspective — assuming sophisticated AI agents deliberately trying to breach safeguards rather than making mistakes.

The AI Safety Atlas (Ch.5) treats control as one of three evaluated properties alongside capability and propensity.

Two Independent Failure Points

Safety operates through complementary barriers:

Alignment — AI systems don’t desire harmful outcomes
Control — AI systems cannot successfully execute harmful actions even if they try

Independent failure points = redundancy. Control evaluations test the second barrier.

This distinction is the operational core of the ai-control paradigm — see buck-shlegeris and redwood-research.

Red vs. Blue Teams

Control evaluations structure safety as a competitive game:

Red teams construct attack policies attempting unacceptable outcomes
Blue teams develop protective protocols

This converts abstract questions into measurable ones: “Can red teams bypass our safety measures?”

Early Detection as Inflection Point

Effective control aims to detect scheming before catastrophic harm — not require perfect prevention indefinitely. Once developers catch a single attack attempt:

Generate diverse attack variants
Train monitoring systems on real examples
Analyze failed countermeasures
Neutralize coordination mechanisms

This shifts the problem from “prevent all attacks” to “detect any one attempt.” A more achievable target.

Defensive Layer Categories

Blue teams should integrate:

Core monitoring systems using trusted/untrusted combinations
Resampling and deference strategies
Automated auditing hierarchies
Human review processes
Comprehensive software infrastructure
Adversarial evaluation and training
Hardware-enforced security enhancements

Red Team Structural Advantages

Red teams possess advantages over actual scheming models:

Extended analysis time (weeks/months vs. single forward passes)
Complete information about safety measures
Ability to learn from failed attempts without shutdown

These advantages allow human red teams to upper-bound what scheming models could execute in deployment. Red-team success → scheming-model-can-do-this; red-team failure → maybe ok.

Critical Design Principles

Toy-model analysis revealed:

Audit timing — random > predictable schedules
Coordination difficulty — making inter-instance coordination harder dramatically improves safety
Risk-stratified auditing — focus resources on higher-risk scenarios

Safety-Usefulness Tradeoff

Game-theoretic analysis can formalize the alignment tax: how much usefulness must be sacrificed for specific safety thresholds. Identifies points where marginal safety improvements require disproportionate utility sacrifice → enables informed deployment decisions.

Fundamental Limitations

Capability Gaps

As differences widen between evaluation and deployment capabilities, safety guarantees weaken.

Superhuman Capabilities Break the Methodology

“When models develop capabilities humans cannot comprehend, devise opaque reasoning chains, or possess qualitatively superhuman security-critical abilities,” the methodology fails.

The Atlas’s stark conclusion: developers should “target systems that are transformatively useful while remaining within controllable bounds” rather than maximizing raw capabilities.

This is an explicit argument for capability ceilings — connects to differential-development and ai-control strategic stance.

The Safety Paradox

Training models to execute attacks during evaluation poses risks. Necessitates additional safeguards:

Sandboxed environments
Limited internet access during testing
Exfiltration guards

Balances thorough testing against potential knowledge creation.

Connection to Wiki

evaluated-properties — parent decomposition
ai-control — the broader paradigm
buck-shlegeris, redwood-research — primary control researchers
control — SR2025 agenda
capability-evaluations — complementary upper-bound measure
propensity-evaluations — complementary default-behavior measure
various-redteams — SR2025 red-team agenda
80k-podcast-buck-shlegeris-ai-control

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.5 — Control Evaluations — referenced as [[atlas-ch5-evaluations-06-control-evaluations]]
Summary: 80,000 Hours Podcast — Buck Shlegeris on AI Control and Scheming — referenced as [[80k-podcast-buck-shlegeris-ai-control]]

AI Safety Compendium

Explorer

Control Evaluations

Control Evaluations

Two Independent Failure Points

Red vs. Blue Teams

Early Detection as Inflection Point

Defensive Layer Categories

Red Team Structural Advantages

Critical Design Principles

Safety-Usefulness Tradeoff

Fundamental Limitations

Capability Gaps

Superhuman Capabilities Break the Methodology

The Safety Paradox

Connection to Wiki

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Control Evaluations

Control Evaluations

Two Independent Failure Points

Red vs. Blue Teams

Early Detection as Inflection Point

Defensive Layer Categories

Red Team Structural Advantages

Critical Design Principles

Safety-Usefulness Tradeoff

Fundamental Limitations

Capability Gaps

Superhuman Capabilities Break the Methodology

The Safety Paradox

Connection to Wiki

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks