Evaluation Frameworks

Evaluation frameworks integrate individual evaluation techniques into structured decision-making systems. The AI Safety Atlas (Ch.5.7) distinguishes two categories — technical and governance — that together form a complete safety pipeline.

Two Framework Categories

Technical Frameworks

Specify which techniques to employ, how to integrate them, what outcomes to measure
Examples: Model Organisms Framework, evaluation suites
Concerned with how to measure

Governance Frameworks

Specify evaluation timing, what results trigger, organizational decision-making
Examples: Anthropic’s RSP, OpenAI’s Preparedness Framework, DeepMind’s FSF
Concerned with what actions follow

Together: technical → measure; governance → act on measurements.

The Model Organisms Framework

Model organisms = deliberately created and studied misaligned AI systems with specific dangerous properties. Biological-research analogy: study disease in lab mice rather than wild populations.

Why intentionally create dangerous systems?

Study concerning properties in controlled environments
Generate concrete examples for the broader community

Decomposition approach: tackle subcomponents (situational awareness, goal persistence) rather than monolithic problems like “deceptive alignment.”

Examples:

Anthropic’s “sleeper agents” — write secure code about 2023; insert vulnerabilities about 2024. Some persisted through safety training.
Reward model sycophancy — exploit RM biases to test alignment-auditing detection
Alignment faking — strategic compliance under monitoring, reverting unmonitored

Tradeoff: realism (useful insights) vs. control (safe study). Constructed-to-exhibit models may not represent natural emergence.

See model-organisms-of-misalignment.

Three Governance Frameworks

RSP — Responsible Scaling Policy (Anthropic)

AI Safety Levels (ASL) inspired by biosafety levels:

ASL-1 — no meaningful catastrophic risk
ASL-2 — early dangerous capability signs
ASL-3 — substantially increases catastrophic misuse risk OR shows low-level autonomous capabilities
ASL-4 / ASL-5+ — not yet defined

Required: capability evaluations + safety evaluations + collaborative function.

Preparedness Framework (OpenAI)

Risk-category-organized rather than fixed levels. Categories: cybersecurity, persuasion, autonomous replication. Each defines spectrums (low → critical risk).

Pre-mitigation vs. post-mitigation:

Pre = raw capability and harm potential
Post = whether safeguards reduce risks acceptably

Deployment baselines:

Post-mitigation “medium” or below → external deploy
Post-mitigation “high” or below → internal deploy
Pre-mitigation “high”/“critical” → weight-exfiltration prevention required

Unique: explicit “unknown unknowns” handling — actively searching unanticipated risks.

Frontier Safety Framework (Google DeepMind)

Critical Capability Levels (CCLs) — separate for biosecurity, cybersecurity, autonomous capabilities. Each maintains its own evaluation requirements + mitigation combinations.

Evaluation timing:

DeepMind FSF — every 6× effective compute increase + 3 months fine-tuning
Anthropic RSP — every 4× effective compute (more frequent)

Why Frameworks Are Necessary

Without integrated frameworks, evaluations are aspirational — they generate data without driving action. Governance frameworks gate scaling on evaluation results, making evaluations consequential.

The three governance frameworks combined are the canonical instances of FSFs — twelve major companies as of March 2025.

Connection to Wiki

ai-safety-levels — ASL specifically
responsible-scaling-policy — Anthropic’s RSP
frontier-safety-frameworks — multi-company cluster
if-then-commitments — operational pattern
model-organisms-of-misalignment — technical framework instance
capability-evaluations — what frameworks gate on
evaluation-design — design principles within frameworks

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.5 — Evaluation Frameworks — referenced as [[atlas-ch5-evaluations-07-evaluation-frameworks]]

AI Safety Compendium

Explorer

Evaluation Frameworks

Evaluation Frameworks

Two Framework Categories

Technical Frameworks

Governance Frameworks

The Model Organisms Framework

Three Governance Frameworks

RSP — Responsible Scaling Policy (Anthropic)

Preparedness Framework (OpenAI)

Frontier Safety Framework (Google DeepMind)

Why Frameworks Are Necessary

Connection to Wiki

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Evaluation Frameworks

Evaluation Frameworks

Two Framework Categories

Technical Frameworks

Governance Frameworks

The Model Organisms Framework

Three Governance Frameworks

RSP — Responsible Scaling Policy (Anthropic)

Preparedness Framework (OpenAI)

Frontier Safety Framework (Google DeepMind)

Why Frameworks Are Necessary

Connection to Wiki

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks