AI Safety Atlas Ch.5 — Introduction
Source: Evaluations — Introduction
“Measuring AI capabilities and safety properties is fundamental to managing AI risk, yet current evaluation methods struggle to keep pace with rapid model advancement.”
The Evaluation Gap Problem
Benchmarks designed to measure AI capabilities become obsolete almost immediately. Concrete data point: FrontierMath, created late 2024 as supposedly years-resistant problems, was largely solved by OpenAI’s o3 within months. “Tools designed to measure AI capabilities become obsolete almost immediately as models rapidly surpass them.”
Why Benchmarks Alone Aren’t Sufficient
Standard benchmarks (MMLU, GPQA) revolutionized AI development but fall short for safety. They don’t capture:
- Real-world deployment complexity
- Edge-case behavior
- Capability combinations not explicitly tested
“A model passing most safety benchmarks might still harbor dangerous capability combinations that weren’t explicitly tested.”
Three-Part Safety Assessment Framework
The chapter’s central organizing structure — see evaluated-properties:
- Dangerous Capabilities — upper bounds on what models can do
- Propensities — what models tend to do by default
- Control — whether safety protocols withstand active circumvention
Two Technique Families
- Behavioral — red teaming, supervised fine-tuning, best-of-N sampling
- Internal — sparse autoencoders, mechanistic interpretability, representation engineering
Four Critical Limitations
- Asymmetric proof — can confirm capabilities exist, can’t prove absence
- Measurement sensitivity — trivial prompt changes alter performance
- Sandbagging — models may strategically underperform
- Safety washing — organizational misrepresentation
Connection to Wiki
This chapter is the technical implementation backbone for everything Ch.3 outlined and Ch.4 governs. Maps directly to:
- capability-evaluations — substantially deepened
- capability-evals, ai-deception-evals, ai-scheming-evals, autonomy-evals, sandbagging-evals, situational-awareness-and-self-awareness-evals, wmd-evals-weapons-of-mass-destruction — SR2025 evaluation agendas
- responsible-scaling-policy, frontier-safety-frameworks — governance frameworks built on evaluations
- if-then-commitments — operationalized through evaluations