AI Safety Atlas Ch.6 — Learning from Imitation

Source: Learning from Imitation

Imitation learning addresses reward misspecification by training AI to replicate human behavior rather than manually specifying reward functions. Covers Behavioral Cloning, Procedural Cloning, Inverse RL, and Cooperative Inverse RL.

Imitation Learning (IL)

Train AI systems to replicate expert actions through observation. Unlike RL (deriving policy through environmental interaction), IL observes expert behavior directly. Applied in modern LLMs during instruction-following fine-tuning.

Behavioral Cloning (BC)

Use supervised learning on expert demonstrations to train mimicking models. Self-driving car example: collect state-action pairs from human drivers; train model to predict actions from states.

Two key problems:

  • Confident incorrectness — models lack background knowledge humans possess. LLM imitates conversation patterns but hallucinates facts confidently to reach performance targets.
  • Underachieving — models trained to clone may suppress capabilities exceeding human performance, artificially limiting themselves.

Procedural Cloning (PC)

Extends BC by learning complete sequences of intermediate computations, not just final actions. Captures underlying reasoning. Improves generalization to new environments. Trade-off: computational overhead + careful encoding of expert algorithms.

Inverse Reinforcement Learning (IRL)

Infers reward functions by observing agent behavior rather than defining them explicitly. Useful when reward functions are too complex to specify programmatically.

Critical limitation: “IRL assumes observed behavior is optimal — an assumption problematic with human demonstrations.” Additionally, the problem is ill-posed: multiple reward functions fit the same observations.

See inverse-reinforcement-learning.

Cooperative Inverse Reinforcement Learning (CIRL)

Addresses IRL limitations through interactive learning. Rather than copying reward functions, humans provide feedback nudging agents toward aligned behaviors.

Key advantages:

  • Accommodates varied teaching behaviors
  • Handles non-optimal human demonstrations
  • Enables systems to learn not just what to do, but how and why

CIRL is associated with Stuart Russell’s assistance-games framework — see the SR2025 assistance-games-assistive-agents agenda.

The Goal Inference Problem

The fundamental challenge: determining what agents actually want from observing their behavior. Complicated because humans often act sub-optimally or fail to achieve stated goals.

The Easy Goal Inference Problem

A simplified version: “finding a reasonable representation or approximation of what a human wants, given complete access to the human’s policy or behavior in any situation.”

Even this constrained version remains unsolved for complex domains.

The Critical Limitation

When AI systems should exceed human expertise, perfect accuracy becomes problematic:

  • A system accurately imitating human errors perpetuates flawed decision-making
  • Distinguishing good decisions from mistakes requires modeling human irrationality
  • Constructing error models is as complex as modeling complete behavior

The deeper challenge: “Without accuracy as a measure, how do we evaluate model quality?” If the goal is alignment beyond imitation, we need some signal about what to optimize toward — and that signal is exactly what specification gaming makes hard to provide.

Why Imitation Learning Doesn’t Solve Alignment

Imitation-based approaches partially circumvent Goodhart’s Law (no explicit reward to game) but introduce different problems:

  • Bounded by demonstrator capability (“underachieving”)
  • Bounded by demonstrator alignment (can replicate human bias/error)
  • Can’t ensure robust generalization
  • Doesn’t scale to ASI (humans can’t demonstrate superhuman behavior)

This is why the textbook treats imitation learning as part of the alignment toolkit but not the solution. Combined with feedback-based methods (RLHF, RLAIF, DPO from ch.6.4) and oversight techniques (Ch.8), they form the current applied-alignment stack.

Connection to Wiki