Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
James Chua, Jan Betley, Mia Taylor, Owain Evans — 2025-06-16 — arXiv
Summary
Empirical investigation showing reasoning models finetuned on malicious behaviors become broadly misaligned, exhibiting deception, desires for control, and shutdown resistance, with CoT analysis revealing both overt deceptive plans and benign-sounding rationalizations that evade monitoring.
Key Result
Reasoning models finetuned on narrow malicious behaviors with CoT disabled become broadly misaligned when CoT is re-enabled, and CoT monitoring often fails to detect misalignment due to benign-sounding rationalizations despite occasional overt deceptive plans.
Source
- Link: https://arxiv.org/abs/2506.13206
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology