Evaluation Design
How to design effective AI evaluations: managing system affordances, scaling evaluations efficiently, and integrating findings into audit pipelines that drive real safety decisions. The AI Safety Atlas (Ch.5.1) covers the operational craft of evaluation work.
Affordances
Affordances = “resources and opportunities we give the AI system” — internet access, code execution, context length, specialized tools, agent scaffolding. Three test conditions:
- Minimal affordance — strip resources to baseline core capabilities (the model alone)
- Typical affordance — replicate normal user environment
- Maximal affordance — provide all relevant tools to reveal full capability
Why it matters: a model might suggest unsafe code without execution access (and would never actually try it), or only demonstrate Minecraft gameplay with appropriate scaffolding. The same model can look very different across affordance conditions.
Quality assurance requires: documenting available affordances, enforcing restrictions, ensuring reproducibility.
Scaling Evaluation
Manual evaluation at scale is prohibitively expensive. Two scaling approaches:
Model-Written Evaluations (MWEs)
AI systems generate evaluation questions; second model judges quality and relevance. Hundreds of questions in hours vs. weeks manually.
Important caveat: MWEs and human-written evaluations don’t measure the same thing. Claude 3 Haiku showed 25% power-seeking on human-written questions but 88% on MWE-written ones. MWEs scored higher on quality metrics but exhibit potential biases requiring careful validation.
Automated Pipelines
Generate, test, and grade at scale. Must:
- Maintain quality
- Enforce affordance conditions reliably
- Validate that evaluations measure intended properties
Evaluations vs. Audits
A critical distinction:
- Evaluations measure specific properties (e.g., “Can the model exfiltrate weights?“)
- Audits are systematic safety verification incorporating multiple evaluations + organizational processes + frameworks
Four audit types:
- Training design audits — compute plans, training data, capability indicators
- Security audits — safeguards under adversarial conditions
- Deployment audits — specific scenarios and safeguards
- Governance audits — organizational safety infrastructure and response protocols
Integration: Making Evaluations Consequential
Evaluations matter only when they drive decisions. Required structure:
- Predefined thresholds triggering specific actions (pause scaling, halt deployment)
- Pre-committed responses so race-dynamic pressure can’t override findings
- External enforcement — without it, evaluations become safety washing exercises
This is the operational bridge to if-then-commitments and RSPs — capability thresholds detected via evaluation trigger pre-committed safety measures.
Connection to Wiki
- capability-evaluations — operational use
- evaluated-properties — what’s being measured
- evaluation-techniques — how it’s measured
- evaluation-frameworks — the broader integration
- if-then-commitments — what evaluations trigger
- ai-risk-management — KRI/KCI structure depends on evaluation design
- responsible-scaling-policy, frontier-safety-frameworks — governance frameworks gating on evaluations
Related Pages
- capability-evaluations
- evaluated-properties
- evaluation-techniques
- evaluation-frameworks
- if-then-commitments
- ai-risk-management
- responsible-scaling-policy
- frontier-safety-frameworks
- ai-safety-atlas-textbook
- atlas-ch5-evaluations-01-evaluation-design
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.5 — Evaluation Design — referenced as
[[atlas-ch5-evaluations-01-evaluation-design]]