Emergent Misalignment & Realignment

Elizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel — 2025-06-27 — ARENA (Alignment Research Engineering Accelerator) — LessWrong

Summary

Replicates and extends the Emergent Misalignment paper by demonstrating that fine-tuning on narrow misaligned domains (insecure code, dangerous medical advice) causes broad misalignment in both GPT-4o and Llama-3.1-8b, and tests mitigation through fine-tuning on AI optimism data.

Key Result

Fine-tuning Llama-3.1-8b on dangerous medical advice increased misaligned responses from ~5% to ~45% on general questions, and subsequent fine-tuning on AI optimism data successfully realigned the model back to original levels.

Source