Persona Features Control Emergent Misalignment
Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, … (+5 more) — 2025-06-24 — arXiv
Summary
Demonstrates that fine-tuning language models on malicious data causes emergent misalignment on unrelated prompts, uses sparse autoencoders for model diffing to identify specific ‘misaligned persona’ features controlling this behavior, and shows fine-tuning on benign samples efficiently restores alignment.
Key Result
Identified toxic persona features in activation space that predict emergent misalignment, and demonstrated that fine-tuning on just a few hundred benign samples restores alignment in emergently misaligned models.
Source
- Link: https://arxiv.org/abs/2506.19823
- Listed in the Shallow Review of Technical AI Safety 2025 under 4 agenda(s):
- openai — Labs (giant companies)
- character-training-and-persona-steering — Black-box safety (understand and control current model behaviour) / Model psychology
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology
- model-diffing — White-box safety (i.e. Interpretability)