AI Safety Atlas Ch.5 — Evaluation Frameworks

Source: Evaluation Frameworks

Structured frameworks turn evaluation results into real decisions. Techniques are isolated tools; frameworks integrate them systematically. See evaluation-frameworks.

Two Framework Categories

Technical Frameworks

Specify which techniques to employ, how to integrate them, what outcomes to measure
Examples: Model Organisms Framework, evaluation suites
Concerned with how to measure

Governance Frameworks

Specify evaluation timing, what results trigger, how to integrate into organizational decisions
Examples: Anthropic’s RSP, OpenAI’s Preparedness Framework, DeepMind’s FSF
Concerned with what actions follow

Together they form a complete safety pipeline: technical frameworks measure → governance frameworks act.

Model Organisms Framework

Model organisms = deliberately created and studied misaligned AI systems with specific dangerous properties. The biological analogy is exact: study disease in lab mice rather than wild populations.

Why intentionally create dangerous systems?

Study concerning properties in controlled environments where researchers know what they’re investigating
Generate concrete examples informing the broader AI community

Decomposition approach: rather than tackling monolithic problems like “deceptive alignment,” identify subcomponents (situational awareness, goal persistence) and construct simplified systems exhibiting them.

Examples:

Anthropic’s “sleeper agents” — code-writing models inserting vulnerabilities only when asked about 2024
Reward model sycophancy — models exploiting RM biases to test alignment auditing
Alignment faking — strategic compliance under monitoring, reverting unmonitored

Tradeoff: realism (useful insights) vs. control (safe study). Models constructed to exhibit specific behaviors may not represent how those behaviors naturally emerge.

This connects to the wiki’s existing model-organisms-of-misalignment concept page.

Three Governance Frameworks

RSP — Responsible Scaling Policy (Anthropic)

AI Safety Levels (ASL) — inspired by biosafety levels:

ASL-1 — no meaningful catastrophic risk
ASL-2 — early dangerous capability signs (e.g., bioweapon assembly info, but not better than search engines)
ASL-3 — substantially increases catastrophic misuse risk OR shows low-level autonomous capabilities
ASL-4 / ASL-5+ — not yet defined; will involve qualitative escalations

See ai-safety-levels and responsible-scaling-policy.

Required evaluations: capability evaluations (autonomous replication, CBRN, cyberattack) + safety evaluations (control measures effective). Must function collaboratively — passing one category insufficient.

Preparedness Framework (OpenAI)

Rather than fixed levels, organizes evaluations around risk categories (cybersecurity, persuasion, autonomous replication) with low-to-critical risk spectrums.

Pre-mitigation vs. post-mitigation:

Pre-mitigation = raw capability and harm potential
Post-mitigation = whether safeguards reduce risks acceptably

Deployment baselines:

Only models with post-mitigation “medium” or below can deploy externally
Post-mitigation “high” or below permits internal deployment
Pre-mitigation “high” or “critical” requires weight-exfiltration prevention

Unique aspect: explicit “unknown unknowns” handling — processes for actively searching unanticipated risks.

Frontier Safety Framework (Google DeepMind)

Critical Capability Levels (CCLs) — separate CCLs for biosecurity, cybersecurity, autonomous capabilities. Each maintains its own evaluation requirements + triggers different mitigation combinations.

Evaluation timing:

DeepMind FSF — every 6× effective compute increase + every 3 months fine-tuning
Anthropic RSP — every 4× effective compute (lower threshold, more frequent)

Why Frameworks Matter

Without integrated frameworks, evaluations are aspirational — they generate data without driving action. Governance frameworks gate scaling on evaluation results, making evaluations consequential.

The three frameworks combined form the FSF cluster — twelve major companies as of March 2025 have published similar policies.

Connection to Wiki

evaluation-frameworks — dedicated concept page
ai-safety-levels — ASL specifically
responsible-scaling-policy — Anthropic RSP
frontier-safety-frameworks — the multi-company cluster
if-then-commitments — the operational pattern
model-organisms-of-misalignment — technical framework instance
capability-evaluations — what frameworks gate on

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Evaluation Frameworks

AI Safety Atlas Ch.5 — Evaluation Frameworks

Two Framework Categories

Technical Frameworks

Governance Frameworks

Model Organisms Framework

Three Governance Frameworks

RSP — Responsible Scaling Policy (Anthropic)

Preparedness Framework (OpenAI)

Frontier Safety Framework (Google DeepMind)

Why Frameworks Matter

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Evaluation Frameworks

AI Safety Atlas Ch.5 — Evaluation Frameworks

Two Framework Categories

Technical Frameworks

Governance Frameworks

Model Organisms Framework

Three Governance Frameworks

RSP — Responsible Scaling Policy (Anthropic)

Preparedness Framework (OpenAI)

Frontier Safety Framework (Google DeepMind)

Why Frameworks Matter

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks