AI Safety Atlas Ch.5 — Control Evaluations
Source: Control Evaluations
Control evaluations test whether safety protocols remain effective when AI systems actively attempt to circumvent them. They adopt an adversarial perspective — assuming sophisticated AI agents deliberately trying to breach safeguards rather than making mistakes. See control-evaluations.
Core Concept: Two Independent Failure Points
Safety operates through two complementary barriers:
- Alignment — AI systems don’t desire harmful outcomes
- Control — AI systems cannot successfully execute harmful actions even if they try
These create independent failure points against catastrophic outcomes. Control evaluations test the second barrier.
Key Mechanisms
Red vs. Blue Teams
Control evaluations structure safety as a competitive game:
- Red teams construct attack policies attempting unacceptable outcomes
- Blue teams develop protective protocols
This converts abstract questions into measurable ones: “Can red teams bypass our safety measures?”
Early Detection as Safety Inflection Point
Effective control aims to detect scheming before catastrophic harm — not require perfect prevention indefinitely. Once developers catch a single attack attempt, they gain powerful tools:
- Generate diverse attack variants
- Train monitoring systems on real examples
- Analyze failed countermeasures
- Neutralize coordination mechanisms
This is the strategic core of Shlegeris/redwood-research’s ai-control agenda.
Defensive Layers
Blue teams should integrate multiple defense categories:
- Core monitoring systems using trusted/untrusted combinations
- Resampling and deference strategies
- Automated auditing hierarchies
- Human review processes
- Comprehensive software infrastructure
- Adversarial evaluation and training
- Hardware-enforced security enhancements
Red Team Advantages
Red teams possess structural advantages over actual scheming models:
- Extended analysis time (weeks/months vs. single forward passes)
- Complete information about safety measures
- Ability to learn from failed attempts without shutdown
These advantages allow human red teams to upper-bound what scheming models could execute in deployment.
Critical Design Principles
Analysis of toy models revealed:
- Audit timing matters — random auditing > predictable schedules
- Coordination difficulty improves safety — making inter-instance coordination harder dramatically improves outcomes
- Risk-stratified auditing — focus resources on higher-risk scenarios
Safety-Usefulness Tradeoff
Game-theoretic analysis can formalize the alignment tax — how much usefulness must be sacrificed for specific safety thresholds. Identifies points where marginal safety improvements require disproportionate utility sacrifice. Enables informed deployment decisions.
Fundamental Limitations
Control evaluations face critical constraints:
Capability Gaps
As differences widen between evaluation and deployment capabilities, safety guarantees weaken.
Superhuman Capabilities Break the Methodology
“When models develop capabilities humans cannot comprehend, devise opaque reasoning chains, or possess qualitatively superhuman security-critical abilities,” the methodology fails.
The Atlas’s stark conclusion: developers should “target systems that are transformatively useful while remaining within controllable bounds” rather than maximizing raw capabilities — a strong argument against unbounded ASI development.
Safety Paradox
Training models to execute attacks during evaluation itself poses risks. Necessitates additional safeguards: sandboxed environments, limited internet access during testing, exfiltration guards. Balances thorough testing against potential knowledge creation.
Connection to Wiki
- control-evaluations — dedicated concept page
- ai-control — the broader paradigm
- buck-shlegeris — control-paradigm originator
- redwood-research — primary control-research org
- control — the SR2025 agenda
- evaluated-properties — the three-way decomposition
- 80k-podcast-buck-shlegeris-ai-control — extended treatment