AI Safety Atlas Ch.5 — Control Evaluations

Source: Control Evaluations

Control evaluations test whether safety protocols remain effective when AI systems actively attempt to circumvent them. They adopt an adversarial perspective — assuming sophisticated AI agents deliberately trying to breach safeguards rather than making mistakes. See control-evaluations.

Core Concept: Two Independent Failure Points

Safety operates through two complementary barriers:

Alignment — AI systems don’t desire harmful outcomes
Control — AI systems cannot successfully execute harmful actions even if they try

These create independent failure points against catastrophic outcomes. Control evaluations test the second barrier.

Key Mechanisms

Red vs. Blue Teams

Control evaluations structure safety as a competitive game:

Red teams construct attack policies attempting unacceptable outcomes
Blue teams develop protective protocols

This converts abstract questions into measurable ones: “Can red teams bypass our safety measures?”

Early Detection as Safety Inflection Point

Effective control aims to detect scheming before catastrophic harm — not require perfect prevention indefinitely. Once developers catch a single attack attempt, they gain powerful tools:

Generate diverse attack variants
Train monitoring systems on real examples
Analyze failed countermeasures
Neutralize coordination mechanisms

This is the strategic core of Shlegeris/redwood-research’s ai-control agenda.

Defensive Layers

Blue teams should integrate multiple defense categories:

Core monitoring systems using trusted/untrusted combinations
Resampling and deference strategies
Automated auditing hierarchies
Human review processes
Comprehensive software infrastructure
Adversarial evaluation and training
Hardware-enforced security enhancements

Red Team Advantages

Red teams possess structural advantages over actual scheming models:

Extended analysis time (weeks/months vs. single forward passes)
Complete information about safety measures
Ability to learn from failed attempts without shutdown

These advantages allow human red teams to upper-bound what scheming models could execute in deployment.

Critical Design Principles

Analysis of toy models revealed:

Audit timing matters — random auditing > predictable schedules
Coordination difficulty improves safety — making inter-instance coordination harder dramatically improves outcomes
Risk-stratified auditing — focus resources on higher-risk scenarios

Safety-Usefulness Tradeoff

Game-theoretic analysis can formalize the alignment tax — how much usefulness must be sacrificed for specific safety thresholds. Identifies points where marginal safety improvements require disproportionate utility sacrifice. Enables informed deployment decisions.

Fundamental Limitations

Control evaluations face critical constraints:

Capability Gaps

As differences widen between evaluation and deployment capabilities, safety guarantees weaken.

Superhuman Capabilities Break the Methodology

“When models develop capabilities humans cannot comprehend, devise opaque reasoning chains, or possess qualitatively superhuman security-critical abilities,” the methodology fails.

The Atlas’s stark conclusion: developers should “target systems that are transformatively useful while remaining within controllable bounds” rather than maximizing raw capabilities — a strong argument against unbounded ASI development.

Safety Paradox

Training models to execute attacks during evaluation itself poses risks. Necessitates additional safeguards: sandboxed environments, limited internet access during testing, exfiltration guards. Balances thorough testing against potential knowledge creation.

Connection to Wiki

control-evaluations — dedicated concept page
ai-control — the broader paradigm
buck-shlegeris — control-paradigm originator
redwood-research — primary control-research org
control — the SR2025 agenda
evaluated-properties — the three-way decomposition
80k-podcast-buck-shlegeris-ai-control — extended treatment

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Control Evaluations

AI Safety Atlas Ch.5 — Control Evaluations

Core Concept: Two Independent Failure Points

Key Mechanisms

Red vs. Blue Teams

Early Detection as Safety Inflection Point

Defensive Layers

Red Team Advantages

Critical Design Principles

Safety-Usefulness Tradeoff

Fundamental Limitations

Capability Gaps

Superhuman Capabilities Break the Methodology

Safety Paradox

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Control Evaluations

AI Safety Atlas Ch.5 — Control Evaluations

Core Concept: Two Independent Failure Points

Key Mechanisms

Red vs. Blue Teams

Early Detection as Safety Inflection Point

Defensive Layers

Red Team Advantages

Critical Design Principles

Safety-Usefulness Tradeoff

Fundamental Limitations

Capability Gaps

Superhuman Capabilities Break the Methodology

Safety Paradox

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks