Evaluation Design

How to design effective AI evaluations: managing system affordances, scaling evaluations efficiently, and integrating findings into audit pipelines that drive real safety decisions. The AI Safety Atlas (Ch.5.1) covers the operational craft of evaluation work.

Affordances

Affordances = “resources and opportunities we give the AI system” — internet access, code execution, context length, specialized tools, agent scaffolding. Three test conditions:

Minimal affordance — strip resources to baseline core capabilities (the model alone)
Typical affordance — replicate normal user environment
Maximal affordance — provide all relevant tools to reveal full capability

Why it matters: a model might suggest unsafe code without execution access (and would never actually try it), or only demonstrate Minecraft gameplay with appropriate scaffolding. The same model can look very different across affordance conditions.

Quality assurance requires: documenting available affordances, enforcing restrictions, ensuring reproducibility.

Scaling Evaluation

Manual evaluation at scale is prohibitively expensive. Two scaling approaches:

Model-Written Evaluations (MWEs)

AI systems generate evaluation questions; second model judges quality and relevance. Hundreds of questions in hours vs. weeks manually.

Important caveat: MWEs and human-written evaluations don’t measure the same thing. Claude 3 Haiku showed 25% power-seeking on human-written questions but 88% on MWE-written ones. MWEs scored higher on quality metrics but exhibit potential biases requiring careful validation.

Automated Pipelines

Generate, test, and grade at scale. Must:

Maintain quality
Enforce affordance conditions reliably
Validate that evaluations measure intended properties

Evaluations vs. Audits

A critical distinction:

Evaluations measure specific properties (e.g., “Can the model exfiltrate weights?“)
Audits are systematic safety verification incorporating multiple evaluations + organizational processes + frameworks

Four audit types:

Training design audits — compute plans, training data, capability indicators
Security audits — safeguards under adversarial conditions
Deployment audits — specific scenarios and safeguards
Governance audits — organizational safety infrastructure and response protocols

Integration: Making Evaluations Consequential

Evaluations matter only when they drive decisions. Required structure:

Predefined thresholds triggering specific actions (pause scaling, halt deployment)
Pre-committed responses so race-dynamic pressure can’t override findings
External enforcement — without it, evaluations become safety washing exercises

This is the operational bridge to if-then-commitments and RSPs — capability thresholds detected via evaluation trigger pre-committed safety measures.

Connection to Wiki

capability-evaluations — operational use
evaluated-properties — what’s being measured
evaluation-techniques — how it’s measured
evaluation-frameworks — the broader integration
if-then-commitments — what evaluations trigger
ai-risk-management — KRI/KCI structure depends on evaluation design
responsible-scaling-policy, frontier-safety-frameworks — governance frameworks gating on evaluations

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.5 — Evaluation Design — referenced as [[atlas-ch5-evaluations-01-evaluation-design]]

AI Safety Compendium

Explorer

Evaluation Design

Evaluation Design

Affordances

Scaling Evaluation

Model-Written Evaluations (MWEs)

Automated Pipelines

Evaluations vs. Audits

Integration: Making Evaluations Consequential

Connection to Wiki

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Evaluation Design

Evaluation Design

Affordances

Scaling Evaluation

Model-Written Evaluations (MWEs)

Automated Pipelines

Evaluations vs. Audits

Integration: Making Evaluations Consequential

Connection to Wiki

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks