Evaluation Frameworks
Evaluation frameworks integrate individual evaluation techniques into structured decision-making systems. The AI Safety Atlas (Ch.5.7) distinguishes two categories — technical and governance — that together form a complete safety pipeline.
Two Framework Categories
Technical Frameworks
- Specify which techniques to employ, how to integrate them, what outcomes to measure
- Examples: Model Organisms Framework, evaluation suites
- Concerned with how to measure
Governance Frameworks
- Specify evaluation timing, what results trigger, organizational decision-making
- Examples: Anthropic’s RSP, OpenAI’s Preparedness Framework, DeepMind’s FSF
- Concerned with what actions follow
Together: technical → measure; governance → act on measurements.
The Model Organisms Framework
Model organisms = deliberately created and studied misaligned AI systems with specific dangerous properties. Biological-research analogy: study disease in lab mice rather than wild populations.
Why intentionally create dangerous systems?
- Study concerning properties in controlled environments
- Generate concrete examples for the broader community
Decomposition approach: tackle subcomponents (situational awareness, goal persistence) rather than monolithic problems like “deceptive alignment.”
Examples:
- Anthropic’s “sleeper agents” — write secure code about 2023; insert vulnerabilities about 2024. Some persisted through safety training.
- Reward model sycophancy — exploit RM biases to test alignment-auditing detection
- Alignment faking — strategic compliance under monitoring, reverting unmonitored
Tradeoff: realism (useful insights) vs. control (safe study). Constructed-to-exhibit models may not represent natural emergence.
See model-organisms-of-misalignment.
Three Governance Frameworks
RSP — Responsible Scaling Policy (Anthropic)
AI Safety Levels (ASL) inspired by biosafety levels:
- ASL-1 — no meaningful catastrophic risk
- ASL-2 — early dangerous capability signs
- ASL-3 — substantially increases catastrophic misuse risk OR shows low-level autonomous capabilities
- ASL-4 / ASL-5+ — not yet defined
Required: capability evaluations + safety evaluations + collaborative function.
Preparedness Framework (OpenAI)
Risk-category-organized rather than fixed levels. Categories: cybersecurity, persuasion, autonomous replication. Each defines spectrums (low → critical risk).
Pre-mitigation vs. post-mitigation:
- Pre = raw capability and harm potential
- Post = whether safeguards reduce risks acceptably
Deployment baselines:
- Post-mitigation “medium” or below → external deploy
- Post-mitigation “high” or below → internal deploy
- Pre-mitigation “high”/“critical” → weight-exfiltration prevention required
Unique: explicit “unknown unknowns” handling — actively searching unanticipated risks.
Frontier Safety Framework (Google DeepMind)
Critical Capability Levels (CCLs) — separate for biosecurity, cybersecurity, autonomous capabilities. Each maintains its own evaluation requirements + mitigation combinations.
Evaluation timing:
- DeepMind FSF — every 6× effective compute increase + 3 months fine-tuning
- Anthropic RSP — every 4× effective compute (more frequent)
Why Frameworks Are Necessary
Without integrated frameworks, evaluations are aspirational — they generate data without driving action. Governance frameworks gate scaling on evaluation results, making evaluations consequential.
The three governance frameworks combined are the canonical instances of FSFs — twelve major companies as of March 2025.
Connection to Wiki
- ai-safety-levels — ASL specifically
- responsible-scaling-policy — Anthropic’s RSP
- frontier-safety-frameworks — multi-company cluster
- if-then-commitments — operational pattern
- model-organisms-of-misalignment — technical framework instance
- capability-evaluations — what frameworks gate on
- evaluation-design — design principles within frameworks
Related Pages
- ai-safety-levels
- responsible-scaling-policy
- frontier-safety-frameworks
- if-then-commitments
- model-organisms-of-misalignment
- capability-evaluations
- evaluation-design
- evaluated-properties
- ai-safety-atlas-textbook
- atlas-ch5-evaluations-07-evaluation-frameworks
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.5 — Evaluation Frameworks — referenced as
[[atlas-ch5-evaluations-07-evaluation-frameworks]]