Off-switching not guaranteed
Sven Neth — 2025-02-26 — University of Pittsburgh — Philosophical Studies
Summary
Critiques Hadfield-Menell et al.’s Off-Switch Game by identifying two conditions under which AI agents would not defer to humans: when agents don’t value learning, and when agents are uncertain about learning actual human preferences even if they value information.
Source
- Link: https://link.springer.com/article/10.1007/s11098-025-02296-x
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- agent-foundations — Theory