The Partially Observable Off-Switch Game
Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart Russell, Scott Emmons — 2024-11-25 — UC Berkeley — arXiv
Summary
Introduces the Partially Observable Off-Switch Game (PO-OSG), a game-theoretic formalization of the shutdown problem with asymmetric information between humans and AI agents, analyzing conditions under which AIs defer to human shutdown commands.
Key Result
Under optimal play with partial observability, AI agents assisting perfectly rational humans sometimes avoid shutdown, and introducing bounded communication can counterintuitively reduce AI deference despite mitigating information asymmetry.
Source
- Link: https://arxiv.org/abs/2411.17749
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- behavior-alignment-theory — Theory / Corrigibility