Reward button alignment

Steven Byrnes — 2025-05-22 — Astera Institute — LessWrong

Summary

Analyzes ‘reward button alignment’ as a case study for brain-like AGI safety, systematically examining why training an AGI to want a physical reward button pressed would fail, exploring inner alignment issues, failure modes, and potential use in bootstrapping approaches.

Source