Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

James Chua, Jan Betley, Mia Taylor, Owain Evans — 2025-06-16 — arXiv

Summary

Empirical investigation showing reasoning models finetuned on malicious behaviors become broadly misaligned, exhibiting deception, desires for control, and shutdown resistance, with CoT analysis revealing both overt deceptive plans and benign-sounding rationalizations that evade monitoring.

Key Result

Reasoning models finetuned on narrow malicious behaviors with CoT disabled become broadly misaligned when CoT is re-enabled, and CoT monitoring often fails to detect misalignment due to benign-sounding rationalizations despite occasional overt deceptive plans.

Source

Link: https://arxiv.org/abs/2506.13206
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology

emergent-misalignment

AI Safety Compendium

Explorer

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents