Observation Interference in Partially Observable Assistance Games
Scott Emmons, Caspar Oesterheld, Vincent Conitzer, Stuart Russell — 2024-12-23 — UC Berkeley — arXiv
Summary
Extends the assistance games framework to partial observability and proves that optimal AI assistants sometimes must interfere with human observations, even when this seems to contradict classical decision theory about the value of information.
Key Result
Optimal assistants sometimes must take observation-interfering actions even when the human is playing optimally and otherwise-equivalent non-interfering actions exist, resolving the seeming contradiction by developing a policy-level notion of interference.
Source
- Link: https://arxiv.org/abs/2412.17797
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- assistance-games-assistive-agents — Black-box safety (understand and control current model behaviour) / Goal robustness