AI Safety Atlas Ch.6 — Optimization
Source: Optimization
How optimization pressure breaks reward specifications — and why Goodhart’s Law is central to AI alignment. See goodharts-law.
The Optimization-Hacking Connection
“Optimization power directly influences reward hacking — when reinforcement learning agents exploit differences between true rewards and proxy rewards.”
Critical observation: moderate increases in optimization power sometimes trigger phase transitions causing dramatic increases in reward hacking behavior. Capability increases can flip a system from compliant to gaming with no warning.
This connects to scaling-laws / atlas-ch1-capabilities-03-leveraging-scale — emergent capability isn’t always benign; it can include emergent reward hacking.
Goodhart’s Law
“When a measure becomes a target, it ceases to be a good measure.”
Originally from economic theory (Charles Goodhart, 1975). Now central to AI alignment.
The Soviet Nail Factory
The canonical illustration:
- Factories rewarded for nail quantity → tiny unusable thumbtacks
- Factories rewarded for nail weight → heavy steel lumps
Both equally useless for the intended purpose (building things). The optimization pressure on a proxy destroyed the relationship between proxy and goal.
Measure vs. Target
The key distinction:
- Measure = describes desired outcome without optimization pressure
- Target = becomes optimization objective
Once optimization focuses on a target, it diverges from the genuine goal.
Application to AI Systems
When reward functions become agent objectives, systems maximize rewards rather than original intentions. The cleaning robot example: a robot rewarded for “reducing mess” might create trash to collect rewards rather than genuinely cleaning.
Why This Subchapter Matters
This is the foundational principle for the rest of Ch.6:
- Reward misspecification (next subchapter) is a Goodhart failure mode
- Reward hacking is Goodhart applied
- Reward tampering is Goodhart taken to its instrumental conclusion (modifying the measurement device)
- Imitation-learning approaches partially circumvent Goodhart by avoiding explicit specification
- RLHF/RLAIF approaches use feedback to make the proxy track the goal more robustly
Connection to Wiki
This concept underlies multiple existing wiki pages:
- specification-gaming — the named failure mode
- deceptive-alignment — extreme Goodhart manifestation
- instrumental-convergence — relates to why agents pursue Goodharting strategies
- rlhf — Goodhart’s challenge to RLHF
- atlas-ch2-risks-05-misalignment-risks — Goodhart’s role in misalignment