AI Safety Atlas Ch.5 — Benchmarks

Source: Benchmarks

A benchmark = standardized measurement tool, comparable to a standardized test. Allows researchers to measure and compare AI capabilities. Distinct from comprehensive evaluations — benchmarks are one input.

How Benchmarks Shape Development

Benchmarks function as both measurement instruments and guides for research direction. When creators establish what to measure, they simultaneously influence what researchers prioritize.

“Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress.” — François Chollet

Historical Evolution

Computer vision progression:

  • MNIST (1998) — 70K handwritten digits
  • CIFAR-10/100 (2009) — natural color images
  • ImageNet (2009) — 1.2M images / 1000 categories

Language models progression:

  • Language Understanding — GLUE, SuperGLUE, SWAG, HellaSwag
  • Broad Knowledge — MMLU (57 subjects), GPQA (web-search resistant), BigBench
  • Math — GSM8K (elementary), MATH (competition)
  • Coding — APPS, HumanEval, HumanEval-XL (23 languages, 12 programming languages)
  • Ethics/Truth — ETHICS benchmark, TruthfulQA
  • Safety — AgentHarm (malicious requests), MACHIAVELLI (game ethics)

Critical Limitations

Training Data Contamination

Popular benchmarks (MMLU, TruthfulQA) are widely discussed online. When this enters training data, high scores reflect memorization rather than understanding.

Understanding vs. Memorization

GPT-4 solves Caesar cipher with common shifts (3, 5) but fails with uncommon values (67) — pattern recognition, not algorithm mastery.

Infinite Task Space

Possible situations are effectively infinite; even millions of training examples represent ~0% of possibilities. Logical relationships (transitivity, inverse) don’t emerge naturally from standard training.

Performance vs. Capability

Benchmarks measure raw scores without systematically assessing whether systems possess underlying capabilities. Cannot distinguish genuine understanding from spurious correlations.

Emerging Approaches

  • FrontierMath — entirely unpublished problems crafted by expert mathematicians. Initial models <2%; o3 reached 25.2% within months. Continuous benchmark saturation.
  • Compute-aware benchmarking — 2024 ARC Prize: 10K compute systems achieved similar accuracy → better algorithms compensate for less compute. Benchmarks must specify compute budgets.
  • Humanity’s Last Exam (HLE) — increasingly difficult, gathers experts across all fields

Beyond Benchmarking: Evaluations

The Atlas’s key distinction:

  • Benchmarks measure what’s easily quantifiable
  • Evaluations work backward from concrete threat models, asking “What could go wrong?” then systematically test for those failure modes

“An evaluation represents a complete safety assessment protocol incorporating benchmarks alongside broader analyses and elicitation methods.”

Organizations like METR evaluate scenarios (e.g., models exploiting security vulnerabilities to gain compute and evade detection) — going far beyond what benchmarks capture.

Connection to Wiki

This subchapter clarifies the benchmarks ⊂ evaluations relationship. Benchmarks are useful for capability tracking but insufficient for safety certification: