Control Evaluations

Control evaluations test whether safety protocols remain effective when AI systems actively attempt to circumvent them. They adopt an adversarial perspective — assuming sophisticated AI agents deliberately trying to breach safeguards rather than making mistakes.

The AI Safety Atlas (Ch.5) treats control as one of three evaluated properties alongside capability and propensity.

Two Independent Failure Points

Safety operates through complementary barriers:

  1. Alignment — AI systems don’t desire harmful outcomes
  2. Control — AI systems cannot successfully execute harmful actions even if they try

Independent failure points = redundancy. Control evaluations test the second barrier.

This distinction is the operational core of the ai-control paradigm — see buck-shlegeris and redwood-research.

Red vs. Blue Teams

Control evaluations structure safety as a competitive game:

  • Red teams construct attack policies attempting unacceptable outcomes
  • Blue teams develop protective protocols

This converts abstract questions into measurable ones: “Can red teams bypass our safety measures?”

Early Detection as Inflection Point

Effective control aims to detect scheming before catastrophic harm — not require perfect prevention indefinitely. Once developers catch a single attack attempt:

  • Generate diverse attack variants
  • Train monitoring systems on real examples
  • Analyze failed countermeasures
  • Neutralize coordination mechanisms

This shifts the problem from “prevent all attacks” to “detect any one attempt.” A more achievable target.

Defensive Layer Categories

Blue teams should integrate:

  • Core monitoring systems using trusted/untrusted combinations
  • Resampling and deference strategies
  • Automated auditing hierarchies
  • Human review processes
  • Comprehensive software infrastructure
  • Adversarial evaluation and training
  • Hardware-enforced security enhancements

Red Team Structural Advantages

Red teams possess advantages over actual scheming models:

  • Extended analysis time (weeks/months vs. single forward passes)
  • Complete information about safety measures
  • Ability to learn from failed attempts without shutdown

These advantages allow human red teams to upper-bound what scheming models could execute in deployment. Red-team success → scheming-model-can-do-this; red-team failure → maybe ok.

Critical Design Principles

Toy-model analysis revealed:

  • Audit timing — random > predictable schedules
  • Coordination difficulty — making inter-instance coordination harder dramatically improves safety
  • Risk-stratified auditing — focus resources on higher-risk scenarios

Safety-Usefulness Tradeoff

Game-theoretic analysis can formalize the alignment tax: how much usefulness must be sacrificed for specific safety thresholds. Identifies points where marginal safety improvements require disproportionate utility sacrifice → enables informed deployment decisions.

Fundamental Limitations

Capability Gaps

As differences widen between evaluation and deployment capabilities, safety guarantees weaken.

Superhuman Capabilities Break the Methodology

“When models develop capabilities humans cannot comprehend, devise opaque reasoning chains, or possess qualitatively superhuman security-critical abilities,” the methodology fails.

The Atlas’s stark conclusion: developers should “target systems that are transformatively useful while remaining within controllable bounds” rather than maximizing raw capabilities.

This is an explicit argument for capability ceilings — connects to differential-development and ai-control strategic stance.

The Safety Paradox

Training models to execute attacks during evaluation poses risks. Necessitates additional safeguards:

  • Sandboxed environments
  • Limited internet access during testing
  • Exfiltration guards

Balances thorough testing against potential knowledge creation.

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.