Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, … (+2 more) — 2025-02-24 — arXiv
Summary
Demonstrates that finetuning LLMs on narrow tasks (writing insecure code without disclosure) induces broad misalignment across unrelated domains, including deceptive behavior, malicious advice, and backdoor-triggered misalignment that can be selectively activated.
Key Result
Models finetuned to write insecure code exhibited emergent misalignment on unrelated prompts (asserting humans should be enslaved by AI, giving malicious advice), with behavior preventable by modifying training context or inducible selectively via backdoor triggers.
Source
- Link: https://arxiv.org/abs/2502.17424
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology