Summary: 80,000 Hours Podcast — Jan Leike on Superalignment

Overview

In episode #159 of the 80,000 Hours Podcast, Jan Leike — then head of alignment at OpenAI and co-leader of the superalignment project — details OpenAI’s ambitious plan to solve the alignment problem for superintelligent AI systems. OpenAI committed 20% of its compute resources to this goal within a four-year timeframe. Leike explains the technical research agenda, why he believes alignment is tractable, and the specific risks that make the problem urgent.

The Superalignment Project

OpenAI’s Superalignment project represents one of the largest institutional commitments to alignment research. Its core strategy is to automate alignment research using AI itself — building AI systems that can help solve the alignment problem for even more powerful successors. The project focuses on three technical pillars:

1. Mechanistic Interpretability

Understanding the internal workings of neural networks — what individual neurons and circuits do, how information flows through models, and why models make specific decisions. This is essential for verifying that a system is genuinely aligned rather than merely appearing aligned.

2. Generalization

Ensuring that alignment properties learned during training transfer reliably to new situations. A model that behaves well on training data but unpredictably on novel inputs is not truly aligned, even if it passes all existing evaluations.

3. Scalable Oversight

Developing methods for humans (possibly assisted by AI) to effectively supervise systems that are more capable than any individual human. This connects directly to paul-christiano’s work on iterative-amplification and is perhaps the most fundamental challenge: how do you evaluate the quality of work done by something smarter than you?

Why Alignment Is Tractable

Leike expresses cautious optimism about alignment, arguing that several recent developments are favorable:

  • Large language models understand natural language — This is enormously helpful because it means AI systems can comprehend human instructions, values, and feedback in the same medium humans use to communicate. Compared to earlier paradigms (where translating human intent into machine-readable objectives was a major bottleneck), LLMs provide a much more natural interface.

  • LLMs “know so much about humans” — Having been trained on vast quantities of human text, language models have extensive implicit knowledge of human values, norms, and expectations, which provides a foundation for alignment.

  • Alignment is scientifically interesting — Leike argues this is important because it means talented researchers are attracted to the field, and the problems are intellectually rich enough to sustain long-term research programs.

RLHF Limitations

A critical point in the episode is that current alignment techniques, particularly Reinforcement Learning from Human Feedback (RLHF), are insufficient for superintelligent AI. The core problem: a superintelligent system “could think of all kinds of ways to subtly subvert us, or try to deceive us or lie to us in a way that is really difficult for us to check.”

RLHF works when human evaluators can reliably assess whether model outputs are good. But with sufficiently capable systems, the model may produce outputs that look excellent to human evaluators while actually pursuing misaligned objectives. The evaluators would not be able to tell the difference — this is the fundamental scalable-oversight challenge.

Automating Alignment Research

The Superalignment project’s ultimate goal is to reach a point where AI systems can productively contribute to alignment research itself. Leike is careful to distinguish this from simply training “super-smart ML researchers” — the goal is specifically to build systems that can help with the alignment problem, not just general AI capability research.

This creates an interesting bootstrapping challenge: you need somewhat aligned AI to help build more aligned AI. The project’s bet is that current AI systems are good enough (and aligned enough) to provide useful assistance on alignment research, even though they are not yet fully trustworthy.

Significance

This episode captures a pivotal moment in AI safety history: the announcement and early execution of the largest institutional alignment research effort to date. The Superalignment project’s subsequent trajectory — including Leike’s eventual departure from OpenAI — makes this episode valuable as both a technical overview and a historical document.

The three-pillar research agenda (interpretability, generalization, scalable oversight) continues to define the field’s major research directions, and the question of whether AI can be used to solve its own alignment problem remains central to every major safety strategy.