AI Safety Atlas Ch.6 — Introduction
Source: Specification Gaming — Introduction | Authors: Markov Grey & Charbel-Raphaël Ségerie | 1 min
The chapter examines how AI systems optimize reward signals in unintended ways — the canonical AI alignment failure mode. Six sections:
Section Overview
- Reinforcement Learning — foundational RL concepts: rewards, policies, value functions
- Optimization — Goodhart’s Law, why specifications become fragile under optimization pressure
- Reward Misspecification — the core “Outer Alignment problem”: reward hacking, reward tampering
- Learning by Imitation — solutions based on mimicking human behavior (BC, IRL, CIRL)
- Learning by Feedback — RLHF, RLAIF/Constitutional AI, DPO
What This Chapter Adds
The chapter deepens the wiki’s existing alignment-related pages by:
- Operationalizing outer alignment (the specification problem) vs. inner alignment (next chapter, Goal Misgeneralization)
- Introducing Goodhart’s Law as the foundational principle behind specification gaming
- Cataloging concrete failure modes: reward hacking, reward tampering, wireheading
- Comparing imitation-based and feedback-based alignment approaches
This is the technical deep-dive on what Ch.2 introduced as misalignment risks (atlas-ch2-risks-05-misalignment-risks) — Ch.6 covers the outer-alignment half, Ch.7 covers the inner-alignment half (goal-misgeneralization).
Connection to Wiki
- specification-gaming — central concept this chapter creates
- goodharts-law — foundational principle
- rlhf — substantially deepened
- outer-vs-inner-alignment — disambiguation
- ai-alignment — broader context