Games for AI Control

Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate — 2025-09-20 — Redwood Research, University of Oxford — ICLR 2026 Conference (Withdrawn)

Summary

Introduces AI-Control Games, a formal game-theoretic framework modeling AI Control red-teaming exercises as partially observable stochastic games, with reductions to enable synthesis of Pareto-optimal deployment protocols for untrusted AI systems.

Key Result

Demonstrates improvements over empirical AI Control studies by formalizing protocol evaluation and synthesizing protocols for deploying untrusted language models as programming assistants with trusted monitoring.

Source