Convergent Linear Representations of Emergent Misalignment
Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda — 2025-06-20 — Google DeepMind — arXiv
Summary
Trains minimal model organisms that develop emergent misalignment through fine-tuning Qwen2.5-14B-Instruct, discovers that different misaligned models converge to similar internal representations, and extracts ‘misalignment directions’ that successfully ablate misaligned behavior across different fine-tunes and datasets.
Key Result
Different emergently misaligned models converge to similar representations of misalignment, enabling extraction of a misalignment direction from one model that effectively ablates misaligned behavior in other models fine-tuned with different LoRA configurations and datasets.
Source
- Link: https://arxiv.org/abs/2506.11618
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability