Games for AI Control

Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate — 2025-09-20 — Redwood Research, University of Oxford — ICLR 2026 Conference (Withdrawn)

Summary

Introduces AI-Control Games, a formal game-theoretic framework modeling AI Control red-teaming exercises as partially observable stochastic games, with reductions to enable synthesis of Pareto-optimal deployment protocols for untrusted AI systems.

Key Result

Demonstrates improvements over empirical AI Control studies by formalizing protocol evaluation and synthesizing protocols for deploying untrusted language models as programming assistants with trusted monitoring.

Source

Link: https://openreview.net/forum?id=QWopGahUEL
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment

control

AI Safety Compendium

Explorer

Games for AI Control

Games for AI Control

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Games for AI Control

Games for AI Control

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents

Backlinks