The Partially Observable Off-Switch Game

Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart Russell, Scott Emmons — 2024-11-25 — UC Berkeley — arXiv

Summary

Introduces the Partially Observable Off-Switch Game (PO-OSG), a game-theoretic formalization of the shutdown problem with asymmetric information between humans and AI agents, analyzing conditions under which AIs defer to human shutdown commands.

Key Result

Under optimal play with partial observability, AI agents assisting perfectly rational humans sometimes avoid shutdown, and introducing bounded communication can counterintuitively reduce AI deference despite mitigating information asymmetry.

Source

Link: https://arxiv.org/abs/2411.17749
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- behavior-alignment-theory — Theory / Corrigibility

behavior-alignment-theory

AI Safety Compendium

Explorer

The Partially Observable Off-Switch Game

The Partially Observable Off-Switch Game

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

The Partially Observable Off-Switch Game

The Partially Observable Off-Switch Game

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents