AI Safety Atlas Ch.5 — Control Evaluations

Source: Control Evaluations

Control evaluations test whether safety protocols remain effective when AI systems actively attempt to circumvent them. They adopt an adversarial perspective — assuming sophisticated AI agents deliberately trying to breach safeguards rather than making mistakes. See control-evaluations.

Core Concept: Two Independent Failure Points

Safety operates through two complementary barriers:

  1. Alignment — AI systems don’t desire harmful outcomes
  2. Control — AI systems cannot successfully execute harmful actions even if they try

These create independent failure points against catastrophic outcomes. Control evaluations test the second barrier.

Key Mechanisms

Red vs. Blue Teams

Control evaluations structure safety as a competitive game:

  • Red teams construct attack policies attempting unacceptable outcomes
  • Blue teams develop protective protocols

This converts abstract questions into measurable ones: “Can red teams bypass our safety measures?”

Early Detection as Safety Inflection Point

Effective control aims to detect scheming before catastrophic harm — not require perfect prevention indefinitely. Once developers catch a single attack attempt, they gain powerful tools:

  • Generate diverse attack variants
  • Train monitoring systems on real examples
  • Analyze failed countermeasures
  • Neutralize coordination mechanisms

This is the strategic core of Shlegeris/redwood-research’s ai-control agenda.

Defensive Layers

Blue teams should integrate multiple defense categories:

  • Core monitoring systems using trusted/untrusted combinations
  • Resampling and deference strategies
  • Automated auditing hierarchies
  • Human review processes
  • Comprehensive software infrastructure
  • Adversarial evaluation and training
  • Hardware-enforced security enhancements

Red Team Advantages

Red teams possess structural advantages over actual scheming models:

  • Extended analysis time (weeks/months vs. single forward passes)
  • Complete information about safety measures
  • Ability to learn from failed attempts without shutdown

These advantages allow human red teams to upper-bound what scheming models could execute in deployment.

Critical Design Principles

Analysis of toy models revealed:

  • Audit timing matters — random auditing > predictable schedules
  • Coordination difficulty improves safety — making inter-instance coordination harder dramatically improves outcomes
  • Risk-stratified auditing — focus resources on higher-risk scenarios

Safety-Usefulness Tradeoff

Game-theoretic analysis can formalize the alignment tax — how much usefulness must be sacrificed for specific safety thresholds. Identifies points where marginal safety improvements require disproportionate utility sacrifice. Enables informed deployment decisions.

Fundamental Limitations

Control evaluations face critical constraints:

Capability Gaps

As differences widen between evaluation and deployment capabilities, safety guarantees weaken.

Superhuman Capabilities Break the Methodology

“When models develop capabilities humans cannot comprehend, devise opaque reasoning chains, or possess qualitatively superhuman security-critical abilities,” the methodology fails.

The Atlas’s stark conclusion: developers should “target systems that are transformatively useful while remaining within controllable bounds” rather than maximizing raw capabilities — a strong argument against unbounded ASI development.

Safety Paradox

Training models to execute attacks during evaluation itself poses risks. Necessitates additional safeguards: sandboxed environments, limited internet access during testing, exfiltration guards. Balances thorough testing against potential knowledge creation.

Connection to Wiki