Reward button alignment
Steven Byrnes — 2025-05-22 — Astera Institute — LessWrong
Summary
Analyzes ‘reward button alignment’ as a case study for brain-like AGI safety, systematically examining why training an AGI to want a physical reward button pressed would fail, exploring inner alignment issues, failure modes, and potential use in bootstrapping approaches.
Source
- Link: https://lesswrong.com/posts/JrTk2pbqp7BFwPAKw/reward-button-alignment
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- brainlike-agi-safety — Safety by construction