Inverse Reinforcement Learning (IRL) and CIRL

Inverse Reinforcement Learning is the family of approaches that infer reward functions by observing agent behavior rather than defining them explicitly. Useful for AI alignment because it sidesteps the specification problem — instead of writing down what we want, we infer it from human behavior.

The AI Safety Atlas (Ch.6.5) treats IRL and its cooperative variant CIRL as imitation-learning approaches to reward misspecification.

Standard IRL

Inverse RL flips the standard RL formulation:

  • Forward RL: given reward function → learn behavior
  • Inverse RL: given behavior → infer reward function

Useful when reward functions are too complex to specify programmatically but expert demonstrations exist.

Two Critical Limitations

1. Optimality Assumption

“IRL assumes observed behavior is optimal — an assumption problematic with human demonstrations.” Humans act sub-optimally, irrationally, or with errors. Inferring “what they want” from “what they do” requires modeling human irrationality, which is itself hard.

2. Ill-Posed Problem

Multiple reward functions fit the same observations. Without further constraints, IRL cannot distinguish between them. This is the “goal inference problem” — observing behavior under-determines underlying reward.

Cooperative IRL (CIRL)

CIRL — associated with Stuart Russell and the assistance games framework (SR2025 agenda) — addresses IRL’s limitations through interactive learning.

Key Idea

Rather than copying inferred reward functions, humans actively provide feedback nudging agents toward aligned behaviors. The AI is uncertain about human preferences and treats human actions as informative signals.

Advantages

  • Accommodates varied teaching behaviors (humans can teach in different styles)
  • Handles non-optimal human demonstrations (humans don’t have to be experts)
  • Enables systems to learn not just what to do but how and why
  • Reduces reliance on explicit reward specification

Connection to Assistance Games

CIRL formalizes a broader idea: AI as an assistant uncertain about human preferences, treating human behavior as evidence. This is the foundation of Russell’s Human Compatible framing — AI safety through deferential uncertainty rather than perfect specification.

The SR2025 assistance-games-assistive-agents agenda continues this research direction.

The Goal Inference Problem

Both IRL and CIRL face the goal inference problem: determining what agents actually want from observing their behavior. The Atlas’s framing:

Easy Goal Inference

“Finding a reasonable representation or approximation of what a human wants, given complete access to the human’s policy or behavior in any situation.”

Even this constrained version remains unsolved for complex domains.

Hard Goal Inference

For systems that should exceed human expertise:

  • Accurately imitating human errors perpetuates flawed decision-making
  • Distinguishing good decisions from mistakes requires modeling human irrationality
  • Modeling human irrationality is as complex as modeling complete behavior
  • Without accuracy as a measure, evaluating model quality becomes circular

This is why imitation-based alignment has fundamental limits — it cannot scale beyond the demonstrators.

Place in the Alignment Landscape

IRL/CIRL are partial circumventions of the specification-gaming problem:

  • ✅ No explicit reward to game (Goodhart-resistant)
  • ✅ Captures implicit human preferences
  • ❌ Bounded by demonstrator capability and alignment
  • ❌ Doesn’t scale to ASI (humans can’t demonstrate superhuman behavior)
  • ❌ Goal-inference problem remains unsolved

Used in practice within the broader alignment toolkit (combined with RLHF, Constitutional AI, oversight techniques) rather than as a standalone solution.

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.