An alignment safety case sketch based on debate

Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, Geoffrey Irving — 2025-05-23 — Google DeepMind — arXiv

Summary

Sketches an alignment safety case arguing that AI systems trained via debate with exploration guarantees can be prevented from taking harmful actions, specifically focusing on preventing AI R&D agents from sabotaging research through dishonesty, and identifies key assumptions and open research problems needed to make debate work.

Source