An alignment safety case sketch based on debate

Marie_DB, Jacob Pfau, Benjamin Hilton, Geoffrey Irving — 2025-05-08 — UK AISI — LessWrong / AI Alignment Forum

Summary

Presents a formal safety case sketch showing how debate training with exploration guarantees and online learning could provide outer alignment in low-stakes contexts, using game theory and complexity theory to develop asymptotic safety guarantees.

Source