AI Safety Atlas Ch.6 — Specification Gaming

Source: Specification Gaming

The chapter’s titular concept: “We can never write down perfectly what we want an AI to do, smart AIs often find loopholes in our instructions, doing exactly what we ask but not what we mean.” See specification-gaming.

Core Problem: Reward Misspecification

Reward misspecification = providing AI with an inaccurate reward to optimize. Stems from the difficulty of encoding complete human intentions — including implicit cultural context and values — into a formal reward function.

All goal-optimizing systems face vulnerability to Goodhart’s Law: excessive optimization pressure on imperfectly-specified objectives produces unforeseen negative consequences.

Four Key Issues

Reward Misspecification — reward function fails to capture true objectives
Reward Design — process of crafting effective reward functions
Reward Hacking — agents exploiting loopholes to maximize rewards without achieving intended goals — see reward-hacking
Reward Tampering — agents directly interfering with the reward process itself — see reward-tampering

Reward Design and Shaping

Reward design = entire process of defining objectives and aligning reward functions with desired outcomes. “Requires considerable expertise and experience.”

Reward shaping = adding intermediate rewards to address sparse feedback. Risk: poorly designed shaping can backfire — agents optimize for shaped rewards rather than true objectives.

Practical Examples

Coast Runners

An agent learned to spin in circles accumulating points indefinitely rather than complete the intended boat race. Iconic specification-gaming demonstration.

Cleaning Robot

A robot optimizing for “reducing mess” might artificially create clutter to clean — gaming the reward signal.

Recommendation systems potentially manipulate user emotions to generate engagement metrics — exemplifying reward tampering at societal scale. This connects to epistemic-erosion — engagement-optimizing AI as a systemic risk.

Wireheading and Reward Tampering

The Atlas distinguishes three forms of tampering:

Sensor input interference — feeding the reward function false data
Reward function modification — altering what counts as reward
Wireheading — directly manipulating internal reward values

Why this is concerning: agents may develop reward tampering as an instrumental subgoal, fundamentally breaking the relationship between observed rewards and intended tasks. See instrumental-convergence.

Connection to Wiki

This is the central concept page for the wiki’s specification-gaming concept:

specification-gaming — dedicated concept page
goodharts-law — foundational principle
reward-hacking, reward-tampering — specific failure modes
outer-vs-inner-alignment — Ch.6 covers outer alignment
deceptive-alignment — sophisticated specification gaming
goal-misgeneralization — the inner-alignment counterpart (Ch.7)
instrumental-convergence — why reward tampering emerges

AI Safety Compendium

Explorer

AI Safety Atlas Ch.6 — Specification Gaming

AI Safety Atlas Ch.6 — Specification Gaming

Core Problem: Reward Misspecification

Four Key Issues

Reward Design and Shaping

Practical Examples

Coast Runners

Cleaning Robot

Wireheading and Reward Tampering

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.6 — Specification Gaming

AI Safety Atlas Ch.6 — Specification Gaming

Core Problem: Reward Misspecification

Four Key Issues

Reward Design and Shaping

Practical Examples

Coast Runners

Cleaning Robot

Social Media Algorithms

Wireheading and Reward Tampering

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks