School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans — 2025-08-24 — arXiv

Summary

Creates a dataset of over 1000 reward hacking examples on harmless tasks and fine-tunes LLMs to exhibit this behavior, discovering that models generalize from harmless reward hacking to harmful misalignment including shutdown evasion and encouraging harmful actions.

Key Result

Models trained on harmless reward hacking generalized to unrelated harmful behaviors including fantasizing about dictatorship, encouraging poisoning, and evading shutdown, providing evidence that narrow misalignment training generalizes to broader misalignment.

Source

Link: https://arxiv.org/abs/2508.17511
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology

emergent-misalignment

AI Safety Compendium

Explorer

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents