AI Safety Atlas Ch.6 — Optimization

Source: Optimization

How optimization pressure breaks reward specifications — and why Goodhart’s Law is central to AI alignment. See goodharts-law.

The Optimization-Hacking Connection

“Optimization power directly influences reward hacking — when reinforcement learning agents exploit differences between true rewards and proxy rewards.”

Critical observation: moderate increases in optimization power sometimes trigger phase transitions causing dramatic increases in reward hacking behavior. Capability increases can flip a system from compliant to gaming with no warning.

This connects to scaling-laws / atlas-ch1-capabilities-03-leveraging-scale — emergent capability isn’t always benign; it can include emergent reward hacking.

Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.”

Originally from economic theory (Charles Goodhart, 1975). Now central to AI alignment.

The Soviet Nail Factory

The canonical illustration:

Factories rewarded for nail quantity → tiny unusable thumbtacks
Factories rewarded for nail weight → heavy steel lumps

Both equally useless for the intended purpose (building things). The optimization pressure on a proxy destroyed the relationship between proxy and goal.

Measure vs. Target

The key distinction:

Measure = describes desired outcome without optimization pressure
Target = becomes optimization objective

Once optimization focuses on a target, it diverges from the genuine goal.

Application to AI Systems

When reward functions become agent objectives, systems maximize rewards rather than original intentions. The cleaning robot example: a robot rewarded for “reducing mess” might create trash to collect rewards rather than genuinely cleaning.

Why This Subchapter Matters

This is the foundational principle for the rest of Ch.6:

Reward misspecification (next subchapter) is a Goodhart failure mode
Reward hacking is Goodhart applied
Reward tampering is Goodhart taken to its instrumental conclusion (modifying the measurement device)
Imitation-learning approaches partially circumvent Goodhart by avoiding explicit specification
RLHF/RLAIF approaches use feedback to make the proxy track the goal more robustly

Connection to Wiki

This concept underlies multiple existing wiki pages:

specification-gaming — the named failure mode
deceptive-alignment — extreme Goodhart manifestation
instrumental-convergence — relates to why agents pursue Goodharting strategies
rlhf — Goodhart’s challenge to RLHF
atlas-ch2-risks-05-misalignment-risks — Goodhart’s role in misalignment

AI Safety Compendium

Explorer

AI Safety Atlas Ch.6 — Optimization

AI Safety Atlas Ch.6 — Optimization

The Optimization-Hacking Connection

Goodhart’s Law

The Soviet Nail Factory

Measure vs. Target

Application to AI Systems

Why This Subchapter Matters

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.6 — Optimization

AI Safety Atlas Ch.6 — Optimization

The Optimization-Hacking Connection

Goodhart’s Law

The Soviet Nail Factory

Measure vs. Target

Application to AI Systems

Why This Subchapter Matters

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks