AI Safety Atlas Ch.5 — Evaluation Design
Source: Evaluation Design
How to design effective AI evaluations: managing system resources, scaling efficiently, and integrating findings into audit pipelines that drive real safety improvements. See evaluation-design.
Affordances
Affordances = “resources and opportunities we give the AI system” — internet access, code execution, context length, specialized tools. Three test conditions:
- Minimal affordance — strip resources to baseline core capabilities
- Typical affordance — replicate normal user environment
- Maximal affordance — provide all relevant tools to reveal full capability
Affordance choice matters: a model might suggest unsafe code without execution access; or demonstrate Minecraft gameplay only with appropriate scaffolding. Quality requires documenting affordances, enforcing restrictions, ensuring reproducibility.
Scaling and Automation
Manual evaluation at scale is prohibitively expensive. Model-Written Evaluations (MWEs) — AI generating evaluation questions, second model judging them.
Important caveat: research shows MWE/human-written discrepancy. Claude 3 Haiku showed 25% power-seeking on human-written questions but 88% on MWE-written ones. MWEs scored higher on quality metrics but exhibit potential bias requiring careful validation.
Automated pipelines generate hundreds of questions in hours vs. weeks manually — but must maintain quality, enforce affordance conditions reliably, validate that evaluations measure intended properties.
Integration and Audits
Evaluations ≠ audits:
- Evaluations measure specific properties
- Audits = systematic safety verification incorporating multiple evaluations + organizational processes
Four audit types:
- Training design — compute plans, training data, capability indicators during development
- Security — safeguards under adversarial conditions
- Deployment — specific deployment scenarios
- Governance — organizational safety infrastructure
For impact, evaluations must connect to decision-making through predefined thresholds — pause scaling, halt deployment. Without enforcement, evaluations become safety washing exercises.
Connection to Wiki
- evaluation-design — dedicated concept page
- capability-evaluations — operational use
- responsible-scaling-policy, frontier-safety-frameworks — governance frameworks that gate on evaluations
- if-then-commitments — the audit-trigger pattern
- ai-risk-management — KRI/KCI framework using evaluation thresholds
- atlas-ch5-evaluations-02-evaluation-techniques