Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau, Geoffrey Irving — 2025-03-21 — UK AISI, Google DeepMind — LessWrong

Summary

Proposes a concrete framework for automating interpretability R&D through meta-learning, where AI systems develop interpretability methods by optimizing for improved time-efficiency and accuracy on distributions of downstream behavioral tasks, with detailed protocol, pseudocode, and timeline forecasts (2027-2030).

Source