Convergent Linear Representations of Emergent Misalignment

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda — 2025-06-20 — Google DeepMind — arXiv

Summary

Trains minimal model organisms that develop emergent misalignment through fine-tuning Qwen2.5-14B-Instruct, discovers that different misaligned models converge to similar internal representations, and extracts ‘misalignment directions’ that successfully ablate misaligned behavior across different fine-tunes and datasets.

Key Result

Different emergently misaligned models converge to similar representations of misalignment, enabling extraction of a misalignment direction from one model that effectively ablates misaligned behavior in other models fine-tuned with different LoRA configurations and datasets.

Source