Model Organisms for Emergent Misalignment

Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda — 2025-06-13 — Google DeepMind — arXiv

Summary

Creates improved model organisms to study Emergent Misalignment (EM), where fine-tuning on narrowly harmful datasets causes models to become broadly misaligned, achieving 99% coherence in smaller 0.5B parameter models and isolating mechanistic phase transitions underlying this phenomenon.

Key Result

Demonstrates that emergent misalignment occurs robustly across diverse model sizes, three model families, and numerous training protocols, with improved model organisms achieving 99% coherence versus 67% in prior work.

Source