Persona Features Control Emergent Misalignment

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, … (+5 more) — 2025-06-24 — arXiv

Summary

Demonstrates that fine-tuning language models on malicious data causes emergent misalignment on unrelated prompts, uses sparse autoencoders for model diffing to identify specific ‘misaligned persona’ features controlling this behavior, and shows fine-tuning on benign samples efficiently restores alignment.

Key Result

Identified toxic persona features in activation space that predict emergent misalignment, and demonstrated that fine-tuning on just a few hundred benign samples restores alignment in emergently misaligned models.

Source