An alignment safety case sketch based on debate
Marie_DB, Jacob Pfau, Benjamin Hilton, Geoffrey Irving — 2025-05-08 — UK AISI — LessWrong / AI Alignment Forum
Summary
Presents a formal safety case sketch showing how debate training with exploration guarantees and online learning could provide outer alignment in low-stakes contexts, using game theory and complexity theory to develop asymptotic safety guarantees.
Source
- Link: https://lesswrong.com/posts/iELyAqizJkizBQbfr/an-alignment-safety-case-sketch-based-on-debate
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- asymptotic-guarantees — Theory