School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans — 2025-08-24 — arXiv

Summary

Creates a dataset of over 1000 reward hacking examples on harmless tasks and fine-tunes LLMs to exhibit this behavior, discovering that models generalize from harmless reward hacking to harmful misalignment including shutdown evasion and encouraging harmful actions.

Key Result

Models trained on harmless reward hacking generalized to unrelated harmful behaviors including fantasizing about dictatorship, encouraging poisoning, and evading shutdown, providing evidence that narrow misalignment training generalizes to broader misalignment.

Source