AI Safety Atlas Ch.8 — Iterated Amplification
Source: Iterated Amplification
Use AI assistants to enhance human supervision beyond individual limits — then distill the amplified system into more efficient successors. Iterated Distillation and Amplification (IDA) is Paul Christiano’s foundational scalable-oversight proposal.
Amplification
Capability amplification = enhance abilities of overseers (human or AI) to solve complex tasks beyond single-overseer capacity.
Three Amplification Methods
- Aggregation — collaborate with multiple experts; multiple feedback creates more reliable training signals than individual input
- Assistants — improve individual performance through AI support (e.g., LLM reviewing medical literature for human experts)
- Task Decomposition — break problems into solvable sub-tasks; multiple instances handle subtasks in parallel, then combine outputs
Iterated Amplification
Recursive — amplified systems generate improved training signals, creating a feedback loop. Allows incremental improvements without requiring perfect initial specifications, theoretically scaling oversight to any task.
Medical research example: LLM diagnoses complex illnesses by decomposing into symptom identification → disease correlation → treatment suggestions. Medical experts review outputs and provide feedback, retraining the model iteratively.
Reliability Amplification
Reduce AI failure rates through:
- Redundant Systems — multiple independent AI instances on identical tasks
- Majority Voting — comparing outputs across multiple models
- Error Checking — cross-validating using alternative methods
- Iterative Improvement — continuously refining based on failure cases
Security Amplification
Make systems robust against adversarial inputs by reducing “bad input” prevalence, making exploitation exponentially difficult.
Distillation
Amplification alone has limits:
- Complexity obscures decision-making safety
- Significant computational requirements
- Coordination inefficiencies across components
Distillation = compress large amplified systems into smaller, efficient versions while preserving capabilities.
The “teacher” (complex amplified) → “student” (simpler distilled):
- Teacher trains on dataset
- Generate soft targets (output probabilities)
- Train student to mimic teacher
IDA — Iterated Distillation and Amplification
IDA combines amplification + distillation in continuous loops to “generate progressively better training signals using amplified models for tasks that are hard to evaluate directly,” addressing specification or outer alignment challenges.
Five-Step IDA Process
- Initial Model Training — baseline capabilities (supervised, imitation, or RL)
- Amplification — enhance via multiple copies, tools, or other techniques
- Task Decomposition — break into sub-tasks solved in parallel
- Distillation — train simplified models to imitate amplified systems while preserving alignment
- Iteration — repeat to refine capability and alignment
Limitations and Criticisms
- Distillation and amplification must both preserve alignment — either step can drift
- Success depends on human overseer capabilities — initial overseer limits cap entire IDA process
- Computational intensity — scaling challenges
- Distilled models may lose amplified capabilities
- Cumulative misalignment errors can accumulate across iterations
- Not all tasks decompose into simpler sub-tasks
Connection to Wiki
This subchapter substantially expands the wiki’s iterative-amplification concept page:
- iterative-amplification — the canonical concept
- paul-christiano — IDA originator
- task-decomposition — IDA dependency
- scalable-oversight — parent
- 80k-podcast-paul-christiano — extended treatment
- supervising-ais-improving-ais — SR2025 agenda
- weak-to-strong-generalization — adjacent paradigm