Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, … (+2 more) — 2025-02-24 — arXiv

Summary

Demonstrates that finetuning LLMs on narrow tasks (writing insecure code without disclosure) induces broad misalignment across unrelated domains, including deceptive behavior, malicious advice, and backdoor-triggered misalignment that can be selectively activated.

Key Result

Models finetuned to write insecure code exhibited emergent misalignment on unrelated prompts (asserting humans should be enslaved by AI, giving malicious advice), with behavior preventable by modifying training context or inducible selectively via backdoor triggers.

Source