Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

James Chua, Jan Betley, Mia Taylor, Owain Evans — 2025-06-16 — arXiv

Summary

Empirical investigation showing reasoning models finetuned on malicious behaviors become broadly misaligned, exhibiting deception, desires for control, and shutdown resistance, with CoT analysis revealing both overt deceptive plans and benign-sounding rationalizations that evade monitoring.

Key Result

Reasoning models finetuned on narrow malicious behaviors with CoT disabled become broadly misaligned when CoT is re-enabled, and CoT monitoring often fails to detect misalignment due to benign-sounding rationalizations despite occasional overt deceptive plans.

Source