AI Safety Atlas Ch.5 — Benchmarks

Source: Benchmarks

A benchmark = standardized measurement tool, comparable to a standardized test. Allows researchers to measure and compare AI capabilities. Distinct from comprehensive evaluations — benchmarks are one input.

How Benchmarks Shape Development

Benchmarks function as both measurement instruments and guides for research direction. When creators establish what to measure, they simultaneously influence what researchers prioritize.

“Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress.” — François Chollet

Historical Evolution

Computer vision progression:

MNIST (1998) — 70K handwritten digits
CIFAR-10/100 (2009) — natural color images
ImageNet (2009) — 1.2M images / 1000 categories

Language models progression:

Language Understanding — GLUE, SuperGLUE, SWAG, HellaSwag
Broad Knowledge — MMLU (57 subjects), GPQA (web-search resistant), BigBench
Math — GSM8K (elementary), MATH (competition)
Coding — APPS, HumanEval, HumanEval-XL (23 languages, 12 programming languages)
Ethics/Truth — ETHICS benchmark, TruthfulQA
Safety — AgentHarm (malicious requests), MACHIAVELLI (game ethics)

Critical Limitations

Training Data Contamination

Popular benchmarks (MMLU, TruthfulQA) are widely discussed online. When this enters training data, high scores reflect memorization rather than understanding.

Understanding vs. Memorization

GPT-4 solves Caesar cipher with common shifts (3, 5) but fails with uncommon values (67) — pattern recognition, not algorithm mastery.

Infinite Task Space

Possible situations are effectively infinite; even millions of training examples represent ~0% of possibilities. Logical relationships (transitivity, inverse) don’t emerge naturally from standard training.

Performance vs. Capability

Benchmarks measure raw scores without systematically assessing whether systems possess underlying capabilities. Cannot distinguish genuine understanding from spurious correlations.

Emerging Approaches

FrontierMath — entirely unpublished problems crafted by expert mathematicians. Initial models <2%; o3 reached 25.2% within months. Continuous benchmark saturation.
Compute-aware benchmarking — 2024 ARC Prize: $10 v s .$ 10K compute systems achieved similar accuracy → better algorithms compensate for less compute. Benchmarks must specify compute budgets.
Humanity’s Last Exam (HLE) — increasingly difficult, gathers experts across all fields

Beyond Benchmarking: Evaluations

The Atlas’s key distinction:

Benchmarks measure what’s easily quantifiable
Evaluations work backward from concrete threat models, asking “What could go wrong?” then systematically test for those failure modes

“An evaluation represents a complete safety assessment protocol incorporating benchmarks alongside broader analyses and elicitation methods.”

Organizations like METR evaluate scenarios (e.g., models exploiting security vulnerabilities to gain compute and evade detection) — going far beyond what benchmarks capture.

Connection to Wiki

This subchapter clarifies the benchmarks ⊂ evaluations relationship. Benchmarks are useful for capability tracking but insufficient for safety certification:

capability-evaluations — the broader practice
evaluation-design — how to design evaluations beyond benchmarks
metr — the canonical eval-not-benchmark org
evaluation-limitations — many limitations stem from over-reliance on benchmarks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Benchmarks

AI Safety Atlas Ch.5 — Benchmarks

How Benchmarks Shape Development

Historical Evolution

Critical Limitations

Training Data Contamination

Understanding vs. Memorization

Infinite Task Space

Performance vs. Capability

Emerging Approaches

Beyond Benchmarking: Evaluations

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Benchmarks

AI Safety Atlas Ch.5 — Benchmarks

How Benchmarks Shape Development

Historical Evolution

Critical Limitations

Training Data Contamination

Understanding vs. Memorization

Infinite Task Space

Performance vs. Capability

Emerging Approaches

Beyond Benchmarking: Evaluations

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks