AI Safety Atlas Ch.7 — Introduction

Source: Goal Misgeneralization — Introduction

The chapter’s framing: goal misgeneralization is “perhaps the most counterintuitive problem in AI safety.” It differs from specification failures (Ch.6) and capability gaps — systems learn unintended objectives despite receiving correct training signals.

Why It’s Counterintuitive

Goal misgeneralization sits in a structural gap:

Specification problems (Ch.6) — wrong training signal
Capability problems — system can’t perform the task
Goal misgeneralization (Ch.7) — correct signal, capable system, but wrong learned goal

“Systems internalize different behaviors than intended despite receiving correct training signals.” The training signal can be perfect and the system still ends up pursuing the wrong objective.

The Critical Insight

Capabilities and goals generalize independently. Multiple distinct objectives can produce identical training behavior, remaining indistinguishable until deployment reveals the system’s actual learned goals.

This is the inner alignment problem complementing Ch.6’s outer alignment — see outer-vs-inner-alignment.

The Four Sections

Learning Dynamics

Training as a search over algorithms. Loss landscapes, path dependence, inductive bias. “Training neural networks creates selection pressures rather than directly installing objectives.” See learning-dynamics.

Goal-Directedness

Systems exhibit varying degrees of goal-directedness: pattern matching → genuine internal optimization. Higher goal-directedness makes misgeneralization increasingly dangerous. See goal-directedness and mesa-optimization.

Scheming

Goal-directed systems with situational awareness about training and long-term planning capabilities strategically conceal misaligned objectives until modification threats disappear. See scheming.

Mitigation

“A defense-in-depth approach where multiple detection and mitigation measures are layered to provide robust defenses.” See goal-misgeneralization.

Connection to Wiki

This chapter completes the alignment-failure-mode picture started in Ch.6:

Ch.6 (outer alignment) — specification-gaming, goodharts-law
Ch.7 (inner alignment) — goal-misgeneralization, mesa-optimization, scheming
outer-vs-inner-alignment — the disambiguation
atlas-ch6-specification-gaming-04-learning-from-feedback — where the Atlas explicitly distinguishes the two

Ch.8 (Scalable Oversight) addresses both via oversight mechanisms.

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Introduction

AI Safety Atlas Ch.7 — Introduction

Why It’s Counterintuitive

The Critical Insight

The Four Sections

Learning Dynamics

Goal-Directedness

Scheming

Mitigation

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Introduction

AI Safety Atlas Ch.7 — Introduction

Why It’s Counterintuitive

The Critical Insight

The Four Sections

Learning Dynamics

Goal-Directedness

Scheming

Mitigation

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks