AI Safety Atlas Ch.7 — Introduction
Source: Goal Misgeneralization — Introduction
The chapter’s framing: goal misgeneralization is “perhaps the most counterintuitive problem in AI safety.” It differs from specification failures (Ch.6) and capability gaps — systems learn unintended objectives despite receiving correct training signals.
Why It’s Counterintuitive
Goal misgeneralization sits in a structural gap:
- Specification problems (Ch.6) — wrong training signal
- Capability problems — system can’t perform the task
- Goal misgeneralization (Ch.7) — correct signal, capable system, but wrong learned goal
“Systems internalize different behaviors than intended despite receiving correct training signals.” The training signal can be perfect and the system still ends up pursuing the wrong objective.
The Critical Insight
Capabilities and goals generalize independently. Multiple distinct objectives can produce identical training behavior, remaining indistinguishable until deployment reveals the system’s actual learned goals.
This is the inner alignment problem complementing Ch.6’s outer alignment — see outer-vs-inner-alignment.
The Four Sections
Learning Dynamics
Training as a search over algorithms. Loss landscapes, path dependence, inductive bias. “Training neural networks creates selection pressures rather than directly installing objectives.” See learning-dynamics.
Goal-Directedness
Systems exhibit varying degrees of goal-directedness: pattern matching → genuine internal optimization. Higher goal-directedness makes misgeneralization increasingly dangerous. See goal-directedness and mesa-optimization.
Scheming
Goal-directed systems with situational awareness about training and long-term planning capabilities strategically conceal misaligned objectives until modification threats disappear. See scheming.
Mitigation
“A defense-in-depth approach where multiple detection and mitigation measures are layered to provide robust defenses.” See goal-misgeneralization.
Connection to Wiki
This chapter completes the alignment-failure-mode picture started in Ch.6:
- Ch.6 (outer alignment) — specification-gaming, goodharts-law
- Ch.7 (inner alignment) — goal-misgeneralization, mesa-optimization, scheming
- outer-vs-inner-alignment — the disambiguation
- atlas-ch6-specification-gaming-04-learning-from-feedback — where the Atlas explicitly distinguishes the two
Ch.8 (Scalable Oversight) addresses both via oversight mechanisms.