Superalignment

Definition

Superalignment is the research program — and broader strategic frame — for aligning AI systems substantially more capable than humans. The term was popularized by OpenAI’s Superalignment team, launched in July 2023 and co-led by Jan Leike and Ilya Sutskever. OpenAI committed 20% of secured compute over four years to the effort (OpenAI 2023, Introducing Superalignment).

The team’s strategy bet: use AI systems to automate alignment research, bootstrapping increasingly aligned successors from the current generation of approximately-aligned models (OpenAI 2023; 80,000 Hours: Jan Leike on Superalignment).

Three technical pillars (OpenAI 2023; Atlas Ch.3 — ASI Safety Strategies; see atlas-ch3-strategies-05-asi-safety-strategies):

Scalable oversight — methods for humans (possibly AI-assisted) to evaluate the work of systems more capable than any individual human.
Generalization — ensuring alignment properties learned during training transfer to novel inputs.
Mechanistic interpretability — verifying alignment via internal inspection rather than behavior alone.

Why it matters

Superalignment is the research program that explicitly addresses the limit case of alignment: systems beyond human evaluation capability. Most current alignment techniques (rlhf, constitutional-ai) implicitly assume a roughly-human-level evaluator. Superalignment asks what happens when that assumption breaks (80,000 Hours: Jan Leike; Atlas Ch.3 — ASI Safety Strategies).

Two reasons it’s load-bearing for the field:

The three-pillar agenda has shaped subsequent research. Even after OpenAI dissolved the team in 2024, the scalable-oversight + generalization + interpretability decomposition remains the standard framing for alignment-research-program design. Most lab safety teams now structure work along similar lines (Atlas Ch.3).
The institutional case study. Superalignment’s trajectory — large compute commitment, public roadmap, then dissolution and Leike’s departure — is widely cited as a cautionary tale about the tension between commercial pressure and safety investment at frontier labs. This is now itself a ai-governance data point (Leike departure tweet thread; Atlas Ch.3).

Key results

The original four-year program (OpenAI 2023). Public commitment to dedicate 20% of secured compute to alignment, with the explicit goal of solving the technical problem of aligning superhuman AI in four years. The first major lab-level resourcing commitment specifically to the superhuman-alignment regime.
Weak-to-strong generalization (Burns et al. 2023, Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision). The team’s first major technical output: an empirical methodology for studying the superhuman-alignment problem now, by using small models to supervise larger ones. Demonstrates that strong models can be elicited above the supervisor’s accuracy ceiling — making the bootstrapping argument empirically tractable. See weak-to-strong-generalization.
The “automate alignment research” thesis. Leike’s central argument is that human-only alignment research cannot keep pace with capability research; the only way to close the gap is to use AI systems to do alignment research at scale (80,000 Hours: Jan Leike). This thesis is contested but has become the dominant frame for safety planning at major labs.
The team’s dissolution (May 2024). Following Leike’s resignation and public statement that “safety culture and processes have taken a backseat to shiny products” (Leike’s tweet thread), the Superalignment team was dissolved and its work distributed across remaining safety teams. Leike subsequently joined Anthropic. The episode is now a standard reference in lab-governance discussions.
The agenda survived institutional dissolution. Most major frontier labs now structure their safety research along the three-pillar pattern — interpretability, scalable oversight, alignment evaluation — even without using the “Superalignment” label. Anthropic’s safety team structure, DeepMind’s amplified-oversight + interpretability split, and OpenAI’s post-dissolution Alignment work all reflect this pattern (Atlas Ch.3 — ASI Safety Strategies).

Open questions

Can alignment research actually be automated at the rate the thesis requires? The bet that AI-assisted alignment research closes faster than capability research is empirically untested. Whether AI systems can usefully contribute to alignment research before they themselves require alignment is the central unresolved question (80,000 Hours: Jan Leike).
Does the bootstrap actually work without a “fully aligned” seed? The strategy uses approximately-aligned current models to build more-aligned successors. If the seed alignment is too weak, errors compound across iterations rather than reducing (Burns et al. 2023, §6).
Are interpretability tools mature enough to verify superhuman alignment? Mechanistic interpretability has shown progress at frontier-model scale (Templeton et al. 2024 Scaling Monosemanticity) but reliability against an adversarial superhuman system is unestablished (Atlas Ch.3).
Can voluntary lab commitments survive competitive pressure? The Superalignment dissolution is a data point that they may not. Whether frontier-lab safety investment can be insulated from product timelines is partly an ai-governance question (Leike departure thread).
How does Superalignment interact with control? Superalignment aims at alignment-of-superhuman-systems; control assumes alignment may fail and bounds consequences. Whether the two are complementary or competing for resources is contested in lab-strategy discussions (Atlas Ch.3).

supervising-ais-improving-ais — the SR2025 agenda for AI-assisted alignment research; the natural extension of the Superalignment thesis.
black-box-make-ai-solve-it — broader bucket for “use AI to solve alignment” approaches.
reverse-engineering, sparse-coding — interpretability agendas instantiating Pillar 3.
chain-of-thought-monitoring — read-only monitoring layered on top of superalignment-trained systems.

scalable-oversight — Pillar 1; the core technical problem.
iterative-amplification — foundational scalable-oversight proposal Superalignment built on.
weak-to-strong-generalization — empirical methodology for studying Pillar 2.
interpretability — Pillar 3; verification rather than behavior-only evaluation.
mechanistic-interpretability — the narrower technical sub-field that supports Pillar 3.
rlhf — what Superalignment is designed to scale beyond.
ai-alignment — the parent problem.
deceptive-alignment — the failure mode Superalignment must address.
ai-control — operational complement when alignment cannot be guaranteed.
asi-safety-strategies — the broader strategy category.
guaranteed-safe-ai — formal-methods alternative for the same problem class.
ai-takeover-scenarios — what alignment failure looks like at superhuman scale.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.3 — ASI Safety Strategies — referenced as [[atlas-ch3-strategies-05-asi-safety-strategies]]
Summary: 80,000 Hours Podcast — Jan Leike on Superalignment — referenced as [[80k-podcast-jan-leike-superalignment]]

AI Safety Compendium

Explorer

Superalignment

Superalignment

Definition

Why it matters

Key results

Open questions

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Superalignment

Superalignment

Definition

Why it matters

Key results

Open questions

Related agendas

Related concepts

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks