Model Organisms for Emergent Misalignment

Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda — 2025-06-13 — Google DeepMind — arXiv

Summary

Creates improved model organisms to study Emergent Misalignment (EM), where fine-tuning on narrowly harmful datasets causes models to become broadly misaligned, achieving 99% coherence in smaller 0.5B parameter models and isolating mechanistic phase transitions underlying this phenomenon.

Key Result

Demonstrates that emergent misalignment occurs robustly across diverse model sizes, three model families, and numerous training protocols, with improved model organisms achieving 99% coherence versus 67% in prior work.

Source

Link: https://arxiv.org/abs/2506.11613
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology

emergent-misalignment

AI Safety Compendium

Explorer

Model Organisms for Emergent Misalignment

Model Organisms for Emergent Misalignment

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Model Organisms for Emergent Misalignment

Model Organisms for Emergent Misalignment

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents