AI Safety Atlas Ch.8 — Iterated Amplification

Source: Iterated Amplification

Use AI assistants to enhance human supervision beyond individual limits — then distill the amplified system into more efficient successors. Iterated Distillation and Amplification (IDA) is Paul Christiano’s foundational scalable-oversight proposal.

Amplification

Capability amplification = enhance abilities of overseers (human or AI) to solve complex tasks beyond single-overseer capacity.

Three Amplification Methods

  • Aggregation — collaborate with multiple experts; multiple feedback creates more reliable training signals than individual input
  • Assistants — improve individual performance through AI support (e.g., LLM reviewing medical literature for human experts)
  • Task Decomposition — break problems into solvable sub-tasks; multiple instances handle subtasks in parallel, then combine outputs

Iterated Amplification

Recursive — amplified systems generate improved training signals, creating a feedback loop. Allows incremental improvements without requiring perfect initial specifications, theoretically scaling oversight to any task.

Medical research example: LLM diagnoses complex illnesses by decomposing into symptom identification → disease correlation → treatment suggestions. Medical experts review outputs and provide feedback, retraining the model iteratively.

Reliability Amplification

Reduce AI failure rates through:

  • Redundant Systems — multiple independent AI instances on identical tasks
  • Majority Voting — comparing outputs across multiple models
  • Error Checking — cross-validating using alternative methods
  • Iterative Improvement — continuously refining based on failure cases

Security Amplification

Make systems robust against adversarial inputs by reducing “bad input” prevalence, making exploitation exponentially difficult.

Distillation

Amplification alone has limits:

  • Complexity obscures decision-making safety
  • Significant computational requirements
  • Coordination inefficiencies across components

Distillation = compress large amplified systems into smaller, efficient versions while preserving capabilities.

The “teacher” (complex amplified) → “student” (simpler distilled):

  1. Teacher trains on dataset
  2. Generate soft targets (output probabilities)
  3. Train student to mimic teacher

IDA — Iterated Distillation and Amplification

IDA combines amplification + distillation in continuous loops to “generate progressively better training signals using amplified models for tasks that are hard to evaluate directly,” addressing specification or outer alignment challenges.

Five-Step IDA Process

  1. Initial Model Training — baseline capabilities (supervised, imitation, or RL)
  2. Amplification — enhance via multiple copies, tools, or other techniques
  3. Task Decomposition — break into sub-tasks solved in parallel
  4. Distillation — train simplified models to imitate amplified systems while preserving alignment
  5. Iteration — repeat to refine capability and alignment

Limitations and Criticisms

  • Distillation and amplification must both preserve alignment — either step can drift
  • Success depends on human overseer capabilities — initial overseer limits cap entire IDA process
  • Computational intensity — scaling challenges
  • Distilled models may lose amplified capabilities
  • Cumulative misalignment errors can accumulate across iterations
  • Not all tasks decompose into simpler sub-tasks

Connection to Wiki

This subchapter substantially expands the wiki’s iterative-amplification concept page: