AI Safety Atlas Ch.5 — Evaluation Design

Source: Evaluation Design

How to design effective AI evaluations: managing system resources, scaling efficiently, and integrating findings into audit pipelines that drive real safety improvements. See evaluation-design.

Affordances

Affordances = “resources and opportunities we give the AI system” — internet access, code execution, context length, specialized tools. Three test conditions:

  • Minimal affordance — strip resources to baseline core capabilities
  • Typical affordance — replicate normal user environment
  • Maximal affordance — provide all relevant tools to reveal full capability

Affordance choice matters: a model might suggest unsafe code without execution access; or demonstrate Minecraft gameplay only with appropriate scaffolding. Quality requires documenting affordances, enforcing restrictions, ensuring reproducibility.

Scaling and Automation

Manual evaluation at scale is prohibitively expensive. Model-Written Evaluations (MWEs) — AI generating evaluation questions, second model judging them.

Important caveat: research shows MWE/human-written discrepancy. Claude 3 Haiku showed 25% power-seeking on human-written questions but 88% on MWE-written ones. MWEs scored higher on quality metrics but exhibit potential bias requiring careful validation.

Automated pipelines generate hundreds of questions in hours vs. weeks manually — but must maintain quality, enforce affordance conditions reliably, validate that evaluations measure intended properties.

Integration and Audits

Evaluations ≠ audits:

  • Evaluations measure specific properties
  • Audits = systematic safety verification incorporating multiple evaluations + organizational processes

Four audit types:

  1. Training design — compute plans, training data, capability indicators during development
  2. Security — safeguards under adversarial conditions
  3. Deployment — specific deployment scenarios
  4. Governance — organizational safety infrastructure

For impact, evaluations must connect to decision-making through predefined thresholds — pause scaling, halt deployment. Without enforcement, evaluations become safety washing exercises.

Connection to Wiki