Evaluation Frameworks

Evaluation frameworks integrate individual evaluation techniques into structured decision-making systems. The AI Safety Atlas (Ch.5.7) distinguishes two categories — technical and governance — that together form a complete safety pipeline.

Two Framework Categories

Technical Frameworks

  • Specify which techniques to employ, how to integrate them, what outcomes to measure
  • Examples: Model Organisms Framework, evaluation suites
  • Concerned with how to measure

Governance Frameworks

  • Specify evaluation timing, what results trigger, organizational decision-making
  • Examples: Anthropic’s RSP, OpenAI’s Preparedness Framework, DeepMind’s FSF
  • Concerned with what actions follow

Together: technical → measure; governance → act on measurements.

The Model Organisms Framework

Model organisms = deliberately created and studied misaligned AI systems with specific dangerous properties. Biological-research analogy: study disease in lab mice rather than wild populations.

Why intentionally create dangerous systems?

  1. Study concerning properties in controlled environments
  2. Generate concrete examples for the broader community

Decomposition approach: tackle subcomponents (situational awareness, goal persistence) rather than monolithic problems like “deceptive alignment.”

Examples:

  • Anthropic’s “sleeper agents” — write secure code about 2023; insert vulnerabilities about 2024. Some persisted through safety training.
  • Reward model sycophancy — exploit RM biases to test alignment-auditing detection
  • Alignment faking — strategic compliance under monitoring, reverting unmonitored

Tradeoff: realism (useful insights) vs. control (safe study). Constructed-to-exhibit models may not represent natural emergence.

See model-organisms-of-misalignment.

Three Governance Frameworks

RSP — Responsible Scaling Policy (Anthropic)

AI Safety Levels (ASL) inspired by biosafety levels:

  • ASL-1 — no meaningful catastrophic risk
  • ASL-2 — early dangerous capability signs
  • ASL-3 — substantially increases catastrophic misuse risk OR shows low-level autonomous capabilities
  • ASL-4 / ASL-5+ — not yet defined

Required: capability evaluations + safety evaluations + collaborative function.

Preparedness Framework (OpenAI)

Risk-category-organized rather than fixed levels. Categories: cybersecurity, persuasion, autonomous replication. Each defines spectrums (low → critical risk).

Pre-mitigation vs. post-mitigation:

  • Pre = raw capability and harm potential
  • Post = whether safeguards reduce risks acceptably

Deployment baselines:

  • Post-mitigation “medium” or below → external deploy
  • Post-mitigation “high” or below → internal deploy
  • Pre-mitigation “high”/“critical” → weight-exfiltration prevention required

Unique: explicit “unknown unknowns” handling — actively searching unanticipated risks.

Frontier Safety Framework (Google DeepMind)

Critical Capability Levels (CCLs) — separate for biosecurity, cybersecurity, autonomous capabilities. Each maintains its own evaluation requirements + mitigation combinations.

Evaluation timing:

  • DeepMind FSF — every 6× effective compute increase + 3 months fine-tuning
  • Anthropic RSP — every 4× effective compute (more frequent)

Why Frameworks Are Necessary

Without integrated frameworks, evaluations are aspirational — they generate data without driving action. Governance frameworks gate scaling on evaluation results, making evaluations consequential.

The three governance frameworks combined are the canonical instances of FSFs — twelve major companies as of March 2025.

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.