Jan Leike

Jan Leike is an ai-alignment researcher who served as head of alignment at openai and co-leader of the superalignment project — one of the largest institutional commitments to alignment research in the field’s history. OpenAI committed 20% of its compute resources to the project within a four-year timeframe. Leike later departed OpenAI, making his tenure a significant chapter in the evolving relationship between frontier AI labs and safety research.

The Superalignment Project

Under Leike’s co-leadership, the Superalignment project pursued a strategy of automating alignment research using AI itself — building AI systems that can help solve the alignment problem for even more powerful successors. The project organized around three technical pillars:

  1. mechanistic interpretability — Understanding the internal workings of neural networks to verify that systems are genuinely aligned rather than merely appearing aligned.
  2. Generalization — Ensuring that alignment properties learned during training transfer reliably to new situations.
  3. scalable-oversight — Developing methods for humans (possibly assisted by AI) to supervise systems more capable than any individual human, building on paul-christiano’s work on iterative-amplification.

Views on Alignment Tractability

Leike expressed cautious optimism about alignment, arguing that large language models’ ability to understand natural language is enormously helpful — it means AI systems can comprehend human instructions and values in the same medium humans use to communicate. He noted that LLMs have extensive implicit knowledge of human values and norms from training on vast quantities of human text.

However, Leike was clear that current techniques, particularly rlhf, are insufficient for superintelligent AI. A superintelligent system “could think of all kinds of ways to subtly subvert us, or try to deceive us or lie to us in a way that is really difficult for us to check.”

Significance

The Superalignment project’s three-pillar research agenda (interpretability, generalization, scalable oversight) continues to define the field’s major research directions. Leike’s eventual departure from openai became an important signal about the tensions between commercial pressures and safety commitments at frontier AI labs.