AI Safety Atlas Ch.5 — Introduction

Source: Evaluations — Introduction

“Measuring AI capabilities and safety properties is fundamental to managing AI risk, yet current evaluation methods struggle to keep pace with rapid model advancement.”

The Evaluation Gap Problem

Benchmarks designed to measure AI capabilities become obsolete almost immediately. Concrete data point: FrontierMath, created late 2024 as supposedly years-resistant problems, was largely solved by OpenAI’s o3 within months. “Tools designed to measure AI capabilities become obsolete almost immediately as models rapidly surpass them.”

Why Benchmarks Alone Aren’t Sufficient

Standard benchmarks (MMLU, GPQA) revolutionized AI development but fall short for safety. They don’t capture:

  • Real-world deployment complexity
  • Edge-case behavior
  • Capability combinations not explicitly tested

“A model passing most safety benchmarks might still harbor dangerous capability combinations that weren’t explicitly tested.”

Three-Part Safety Assessment Framework

The chapter’s central organizing structure — see evaluated-properties:

  1. Dangerous Capabilities — upper bounds on what models can do
  2. Propensities — what models tend to do by default
  3. Control — whether safety protocols withstand active circumvention

Two Technique Families

See evaluation-techniques:

  • Behavioral — red teaming, supervised fine-tuning, best-of-N sampling
  • Internal — sparse autoencoders, mechanistic interpretability, representation engineering

Four Critical Limitations

See evaluation-limitations:

  • Asymmetric proof — can confirm capabilities exist, can’t prove absence
  • Measurement sensitivity — trivial prompt changes alter performance
  • Sandbagging — models may strategically underperform
  • Safety washing — organizational misrepresentation

Connection to Wiki

This chapter is the technical implementation backbone for everything Ch.3 outlined and Ch.4 governs. Maps directly to: