AI Safety Atlas Ch.5 — Introduction

Source: Evaluations — Introduction

“Measuring AI capabilities and safety properties is fundamental to managing AI risk, yet current evaluation methods struggle to keep pace with rapid model advancement.”

The Evaluation Gap Problem

Benchmarks designed to measure AI capabilities become obsolete almost immediately. Concrete data point: FrontierMath, created late 2024 as supposedly years-resistant problems, was largely solved by OpenAI’s o3 within months. “Tools designed to measure AI capabilities become obsolete almost immediately as models rapidly surpass them.”

Why Benchmarks Alone Aren’t Sufficient

Standard benchmarks (MMLU, GPQA) revolutionized AI development but fall short for safety. They don’t capture:

Real-world deployment complexity
Edge-case behavior
Capability combinations not explicitly tested

“A model passing most safety benchmarks might still harbor dangerous capability combinations that weren’t explicitly tested.”

Three-Part Safety Assessment Framework

The chapter’s central organizing structure — see evaluated-properties:

Dangerous Capabilities — upper bounds on what models can do
Propensities — what models tend to do by default
Control — whether safety protocols withstand active circumvention

Two Technique Families

See evaluation-techniques:

Behavioral — red teaming, supervised fine-tuning, best-of-N sampling
Internal — sparse autoencoders, mechanistic interpretability, representation engineering

Four Critical Limitations

See evaluation-limitations:

Asymmetric proof — can confirm capabilities exist, can’t prove absence
Measurement sensitivity — trivial prompt changes alter performance
Sandbagging — models may strategically underperform
Safety washing — organizational misrepresentation

Connection to Wiki

This chapter is the technical implementation backbone for everything Ch.3 outlined and Ch.4 governs. Maps directly to:

capability-evaluations — substantially deepened
capability-evals, ai-deception-evals, ai-scheming-evals, autonomy-evals, sandbagging-evals, situational-awareness-and-self-awareness-evals, wmd-evals-weapons-of-mass-destruction — SR2025 evaluation agendas
responsible-scaling-policy, frontier-safety-frameworks — governance frameworks built on evaluations
if-then-commitments — operationalized through evaluations

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Introduction

AI Safety Atlas Ch.5 — Introduction

The Evaluation Gap Problem

Why Benchmarks Alone Aren’t Sufficient

Three-Part Safety Assessment Framework

Two Technique Families

Four Critical Limitations

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Introduction

AI Safety Atlas Ch.5 — Introduction

The Evaluation Gap Problem

Why Benchmarks Alone Aren’t Sufficient

Three-Part Safety Assessment Framework

Two Technique Families

Four Critical Limitations

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks