Convergent Linear Representations of Emergent Misalignment

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda — 2025-06-20 — Google DeepMind — arXiv

Summary

Trains minimal model organisms that develop emergent misalignment through fine-tuning Qwen2.5-14B-Instruct, discovers that different misaligned models converge to similar internal representations, and extracts ‘misalignment directions’ that successfully ablate misaligned behavior across different fine-tunes and datasets.

Key Result

Different emergently misaligned models converge to similar representations of misalignment, enabling extraction of a misalignment direction from one model that effectively ablates misaligned behavior in other models fine-tuned with different LoRA configurations and datasets.

Source

Link: https://arxiv.org/abs/2506.11618
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability

monitoring-concepts

AI Safety Compendium

Explorer

Convergent Linear Representations of Emergent Misalignment

Convergent Linear Representations of Emergent Misalignment

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Convergent Linear Representations of Emergent Misalignment

Convergent Linear Representations of Emergent Misalignment

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents