Model Organisms for Emergent Misalignment
Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda — 2025-06-13 — Google DeepMind — arXiv
Summary
Creates improved model organisms to study Emergent Misalignment (EM), where fine-tuning on narrowly harmful datasets causes models to become broadly misaligned, achieving 99% coherence in smaller 0.5B parameter models and isolating mechanistic phase transitions underlying this phenomenon.
Key Result
Demonstrates that emergent misalignment occurs robustly across diverse model sizes, three model families, and numerous training protocols, with improved model organisms achieving 99% coherence versus 67% in prior work.
Source
- Link: https://arxiv.org/abs/2506.11613
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology