Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, … (+2 more) — 2025-02-24 — arXiv

Summary

Demonstrates that finetuning LLMs on narrow tasks (writing insecure code without disclosure) induces broad misalignment across unrelated domains, including deceptive behavior, malicious advice, and backdoor-triggered misalignment that can be selectively activated.

Key Result

Models finetuned to write insecure code exhibited emergent misalignment on unrelated prompts (asserting humans should be enslaved by AI, giving malicious advice), with behavior preventable by modifying training context or inducible selectively via backdoor triggers.

Source

Link: https://arxiv.org/abs/2502.17424
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology

emergent-misalignment

AI Safety Compendium

Explorer

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents