AI Safety Atlas Ch.5 — Evaluation Design

Source: Evaluation Design

How to design effective AI evaluations: managing system resources, scaling efficiently, and integrating findings into audit pipelines that drive real safety improvements. See evaluation-design.

Affordances

Affordances = “resources and opportunities we give the AI system” — internet access, code execution, context length, specialized tools. Three test conditions:

Minimal affordance — strip resources to baseline core capabilities
Typical affordance — replicate normal user environment
Maximal affordance — provide all relevant tools to reveal full capability

Affordance choice matters: a model might suggest unsafe code without execution access; or demonstrate Minecraft gameplay only with appropriate scaffolding. Quality requires documenting affordances, enforcing restrictions, ensuring reproducibility.

Scaling and Automation

Manual evaluation at scale is prohibitively expensive. Model-Written Evaluations (MWEs) — AI generating evaluation questions, second model judging them.

Important caveat: research shows MWE/human-written discrepancy. Claude 3 Haiku showed 25% power-seeking on human-written questions but 88% on MWE-written ones. MWEs scored higher on quality metrics but exhibit potential bias requiring careful validation.

Automated pipelines generate hundreds of questions in hours vs. weeks manually — but must maintain quality, enforce affordance conditions reliably, validate that evaluations measure intended properties.

Integration and Audits

Evaluations ≠ audits:

Evaluations measure specific properties
Audits = systematic safety verification incorporating multiple evaluations + organizational processes

Four audit types:

Training design — compute plans, training data, capability indicators during development
Security — safeguards under adversarial conditions
Deployment — specific deployment scenarios
Governance — organizational safety infrastructure

For impact, evaluations must connect to decision-making through predefined thresholds — pause scaling, halt deployment. Without enforcement, evaluations become safety washing exercises.

Connection to Wiki

evaluation-design — dedicated concept page
capability-evaluations — operational use
responsible-scaling-policy, frontier-safety-frameworks — governance frameworks that gate on evaluations
if-then-commitments — the audit-trigger pattern
ai-risk-management — KRI/KCI framework using evaluation thresholds
atlas-ch5-evaluations-02-evaluation-techniques

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Evaluation Design

AI Safety Atlas Ch.5 — Evaluation Design

Affordances

Scaling and Automation

Integration and Audits

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Evaluation Design

AI Safety Atlas Ch.5 — Evaluation Design

Affordances

Scaling and Automation

Integration and Audits

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks