AI Safety Atlas Ch.6 — Introduction

Source: Specification Gaming — Introduction | Authors: Markov Grey & Charbel-Raphaël Ségerie | 1 min

The chapter examines how AI systems optimize reward signals in unintended ways — the canonical AI alignment failure mode. Six sections:

Section Overview

  • Reinforcement Learning — foundational RL concepts: rewards, policies, value functions
  • Optimization — Goodhart’s Law, why specifications become fragile under optimization pressure
  • Reward Misspecification — the core “Outer Alignment problem”: reward hacking, reward tampering
  • Learning by Imitation — solutions based on mimicking human behavior (BC, IRL, CIRL)
  • Learning by Feedback — RLHF, RLAIF/Constitutional AI, DPO

What This Chapter Adds

The chapter deepens the wiki’s existing alignment-related pages by:

  • Operationalizing outer alignment (the specification problem) vs. inner alignment (next chapter, Goal Misgeneralization)
  • Introducing Goodhart’s Law as the foundational principle behind specification gaming
  • Cataloging concrete failure modes: reward hacking, reward tampering, wireheading
  • Comparing imitation-based and feedback-based alignment approaches

This is the technical deep-dive on what Ch.2 introduced as misalignment risks (atlas-ch2-risks-05-misalignment-risks) — Ch.6 covers the outer-alignment half, Ch.7 covers the inner-alignment half (goal-misgeneralization).

Connection to Wiki