AI Safety Atlas Ch.5 — Evaluation Frameworks
Source: Evaluation Frameworks
Structured frameworks turn evaluation results into real decisions. Techniques are isolated tools; frameworks integrate them systematically. See evaluation-frameworks.
Two Framework Categories
Technical Frameworks
- Specify which techniques to employ, how to integrate them, what outcomes to measure
- Examples: Model Organisms Framework, evaluation suites
- Concerned with how to measure
Governance Frameworks
- Specify evaluation timing, what results trigger, how to integrate into organizational decisions
- Examples: Anthropic’s RSP, OpenAI’s Preparedness Framework, DeepMind’s FSF
- Concerned with what actions follow
Together they form a complete safety pipeline: technical frameworks measure → governance frameworks act.
Model Organisms Framework
Model organisms = deliberately created and studied misaligned AI systems with specific dangerous properties. The biological analogy is exact: study disease in lab mice rather than wild populations.
Why intentionally create dangerous systems?
- Study concerning properties in controlled environments where researchers know what they’re investigating
- Generate concrete examples informing the broader AI community
Decomposition approach: rather than tackling monolithic problems like “deceptive alignment,” identify subcomponents (situational awareness, goal persistence) and construct simplified systems exhibiting them.
Examples:
- Anthropic’s “sleeper agents” — code-writing models inserting vulnerabilities only when asked about 2024
- Reward model sycophancy — models exploiting RM biases to test alignment auditing
- Alignment faking — strategic compliance under monitoring, reverting unmonitored
Tradeoff: realism (useful insights) vs. control (safe study). Models constructed to exhibit specific behaviors may not represent how those behaviors naturally emerge.
This connects to the wiki’s existing model-organisms-of-misalignment concept page.
Three Governance Frameworks
RSP — Responsible Scaling Policy (Anthropic)
AI Safety Levels (ASL) — inspired by biosafety levels:
- ASL-1 — no meaningful catastrophic risk
- ASL-2 — early dangerous capability signs (e.g., bioweapon assembly info, but not better than search engines)
- ASL-3 — substantially increases catastrophic misuse risk OR shows low-level autonomous capabilities
- ASL-4 / ASL-5+ — not yet defined; will involve qualitative escalations
See ai-safety-levels and responsible-scaling-policy.
Required evaluations: capability evaluations (autonomous replication, CBRN, cyberattack) + safety evaluations (control measures effective). Must function collaboratively — passing one category insufficient.
Preparedness Framework (OpenAI)
Rather than fixed levels, organizes evaluations around risk categories (cybersecurity, persuasion, autonomous replication) with low-to-critical risk spectrums.
Pre-mitigation vs. post-mitigation:
- Pre-mitigation = raw capability and harm potential
- Post-mitigation = whether safeguards reduce risks acceptably
Deployment baselines:
- Only models with post-mitigation “medium” or below can deploy externally
- Post-mitigation “high” or below permits internal deployment
- Pre-mitigation “high” or “critical” requires weight-exfiltration prevention
Unique aspect: explicit “unknown unknowns” handling — processes for actively searching unanticipated risks.
Frontier Safety Framework (Google DeepMind)
Critical Capability Levels (CCLs) — separate CCLs for biosecurity, cybersecurity, autonomous capabilities. Each maintains its own evaluation requirements + triggers different mitigation combinations.
Evaluation timing:
- DeepMind FSF — every 6× effective compute increase + every 3 months fine-tuning
- Anthropic RSP — every 4× effective compute (lower threshold, more frequent)
Why Frameworks Matter
Without integrated frameworks, evaluations are aspirational — they generate data without driving action. Governance frameworks gate scaling on evaluation results, making evaluations consequential.
The three frameworks combined form the FSF cluster — twelve major companies as of March 2025 have published similar policies.
Connection to Wiki
- evaluation-frameworks — dedicated concept page
- ai-safety-levels — ASL specifically
- responsible-scaling-policy — Anthropic RSP
- frontier-safety-frameworks — the multi-company cluster
- if-then-commitments — the operational pattern
- model-organisms-of-misalignment — technical framework instance
- capability-evaluations — what frameworks gate on