Emergent Misalignment & Realignment
Elizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel — 2025-06-27 — ARENA (Alignment Research Engineering Accelerator) — LessWrong
Summary
Replicates and extends the Emergent Misalignment paper by demonstrating that fine-tuning on narrow misaligned domains (insecure code, dangerous medical advice) causes broad misalignment in both GPT-4o and Llama-3.1-8b, and tests mitigation through fine-tuning on AI optimism data.
Key Result
Fine-tuning Llama-3.1-8b on dangerous medical advice increased misaligned responses from ~5% to ~45% on general questions, and subsequent fine-tuning on AI optimism data successfully realigned the model back to original levels.
Source
- Link: https://lesswrong.com/posts/ZdY4JzBPJEgaoCxTR/emergent-misalignment-and-realignment
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology