AI Safety Atlas Ch.5 — Evaluation Frameworks

Source: Evaluation Frameworks

Structured frameworks turn evaluation results into real decisions. Techniques are isolated tools; frameworks integrate them systematically. See evaluation-frameworks.

Two Framework Categories

Technical Frameworks

  • Specify which techniques to employ, how to integrate them, what outcomes to measure
  • Examples: Model Organisms Framework, evaluation suites
  • Concerned with how to measure

Governance Frameworks

  • Specify evaluation timing, what results trigger, how to integrate into organizational decisions
  • Examples: Anthropic’s RSP, OpenAI’s Preparedness Framework, DeepMind’s FSF
  • Concerned with what actions follow

Together they form a complete safety pipeline: technical frameworks measure → governance frameworks act.

Model Organisms Framework

Model organisms = deliberately created and studied misaligned AI systems with specific dangerous properties. The biological analogy is exact: study disease in lab mice rather than wild populations.

Why intentionally create dangerous systems?

  1. Study concerning properties in controlled environments where researchers know what they’re investigating
  2. Generate concrete examples informing the broader AI community

Decomposition approach: rather than tackling monolithic problems like “deceptive alignment,” identify subcomponents (situational awareness, goal persistence) and construct simplified systems exhibiting them.

Examples:

  • Anthropic’s “sleeper agents” — code-writing models inserting vulnerabilities only when asked about 2024
  • Reward model sycophancy — models exploiting RM biases to test alignment auditing
  • Alignment faking — strategic compliance under monitoring, reverting unmonitored

Tradeoff: realism (useful insights) vs. control (safe study). Models constructed to exhibit specific behaviors may not represent how those behaviors naturally emerge.

This connects to the wiki’s existing model-organisms-of-misalignment concept page.

Three Governance Frameworks

RSP — Responsible Scaling Policy (Anthropic)

AI Safety Levels (ASL) — inspired by biosafety levels:

  • ASL-1 — no meaningful catastrophic risk
  • ASL-2 — early dangerous capability signs (e.g., bioweapon assembly info, but not better than search engines)
  • ASL-3 — substantially increases catastrophic misuse risk OR shows low-level autonomous capabilities
  • ASL-4 / ASL-5+ — not yet defined; will involve qualitative escalations

See ai-safety-levels and responsible-scaling-policy.

Required evaluations: capability evaluations (autonomous replication, CBRN, cyberattack) + safety evaluations (control measures effective). Must function collaboratively — passing one category insufficient.

Preparedness Framework (OpenAI)

Rather than fixed levels, organizes evaluations around risk categories (cybersecurity, persuasion, autonomous replication) with low-to-critical risk spectrums.

Pre-mitigation vs. post-mitigation:

  • Pre-mitigation = raw capability and harm potential
  • Post-mitigation = whether safeguards reduce risks acceptably

Deployment baselines:

  • Only models with post-mitigation “medium” or below can deploy externally
  • Post-mitigation “high” or below permits internal deployment
  • Pre-mitigation “high” or “critical” requires weight-exfiltration prevention

Unique aspect: explicit “unknown unknowns” handling — processes for actively searching unanticipated risks.

Frontier Safety Framework (Google DeepMind)

Critical Capability Levels (CCLs) — separate CCLs for biosecurity, cybersecurity, autonomous capabilities. Each maintains its own evaluation requirements + triggers different mitigation combinations.

Evaluation timing:

  • DeepMind FSF — every 6× effective compute increase + every 3 months fine-tuning
  • Anthropic RSP — every 4× effective compute (lower threshold, more frequent)

Why Frameworks Matter

Without integrated frameworks, evaluations are aspirational — they generate data without driving action. Governance frameworks gate scaling on evaluation results, making evaluations consequential.

The three frameworks combined form the FSF cluster — twelve major companies as of March 2025 have published similar policies.

Connection to Wiki