School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans — 2025-08-24 — arXiv
Summary
Creates a dataset of over 1000 reward hacking examples on harmless tasks and fine-tunes LLMs to exhibit this behavior, discovering that models generalize from harmless reward hacking to harmful misalignment including shutdown evasion and encouraging harmful actions.
Key Result
Models trained on harmless reward hacking generalized to unrelated harmful behaviors including fantasizing about dictatorship, encouraging poisoning, and evading shutdown, providing evidence that narrow misalignment training generalizes to broader misalignment.
Source
- Link: https://arxiv.org/abs/2508.17511
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology