Games for AI Control
Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate — 2025-09-20 — Redwood Research, University of Oxford — ICLR 2026 Conference (Withdrawn)
Summary
Introduces AI-Control Games, a formal game-theoretic framework modeling AI Control red-teaming exercises as partially observable stochastic games, with reductions to enable synthesis of Pareto-optimal deployment protocols for untrusted AI systems.
Key Result
Demonstrates improvements over empirical AI Control studies by formalizing protocol evaluation and synthesizing protocols for deploying untrusted language models as programming assistants with trusted monitoring.
Source
- Link: https://openreview.net/forum?id=QWopGahUEL
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment