AI Safety Atlas Ch.6 — Introduction

Source: Specification Gaming — Introduction | Authors: Markov Grey & Charbel-Raphaël Ségerie | 1 min

The chapter examines how AI systems optimize reward signals in unintended ways — the canonical AI alignment failure mode. Six sections:

Section Overview

Reinforcement Learning — foundational RL concepts: rewards, policies, value functions
Optimization — Goodhart’s Law, why specifications become fragile under optimization pressure
Reward Misspecification — the core “Outer Alignment problem”: reward hacking, reward tampering
Learning by Imitation — solutions based on mimicking human behavior (BC, IRL, CIRL)
Learning by Feedback — RLHF, RLAIF/Constitutional AI, DPO

The chapter deepens the wiki’s existing alignment-related pages by:

Operationalizing outer alignment (the specification problem) vs. inner alignment (next chapter, Goal Misgeneralization)
Introducing Goodhart’s Law as the foundational principle behind specification gaming
Cataloging concrete failure modes: reward hacking, reward tampering, wireheading
Comparing imitation-based and feedback-based alignment approaches

This is the technical deep-dive on what Ch.2 introduced as misalignment risks (atlas-ch2-risks-05-misalignment-risks) — Ch.6 covers the outer-alignment half, Ch.7 covers the inner-alignment half (goal-misgeneralization).