Persona Features Control Emergent Misalignment

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, … (+5 more) — 2025-06-24 — arXiv

Summary

Demonstrates that fine-tuning language models on malicious data causes emergent misalignment on unrelated prompts, uses sparse autoencoders for model diffing to identify specific ‘misaligned persona’ features controlling this behavior, and shows fine-tuning on benign samples efficiently restores alignment.

Key Result

Identified toxic persona features in activation space that predict emergent misalignment, and demonstrated that fine-tuning on just a few hundred benign samples restores alignment in emergently misaligned models.

Source

Link: https://arxiv.org/abs/2506.19823
Listed in the Shallow Review of Technical AI Safety 2025 under 4 agenda(s):
- openai — Labs (giant companies)
- character-training-and-persona-steering — Black-box safety (understand and control current model behaviour) / Model psychology
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology
- model-diffing — White-box safety (i.e. Interpretability)

AI Safety Compendium

Explorer

Persona Features Control Emergent Misalignment

Persona Features Control Emergent Misalignment

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Persona Features Control Emergent Misalignment

Persona Features Control Emergent Misalignment

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents