An alignment safety case sketch based on debate
Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, Geoffrey Irving — 2025-05-23 — Google DeepMind — arXiv
Summary
Sketches an alignment safety case arguing that AI systems trained via debate with exploration guarantees can be prevented from taking harmful actions, specifically focusing on preventing AI R&D agents from sabotaging research through dishonesty, and identifies key assumptions and open research problems needed to make debate work.
Source
- Link: https://arxiv.org/abs/2505.03989
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment
- debate — Make AI solve it