Toward understanding and preventing misalignment generalization

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Aleksandar Makelov, Ryan A. Chi, Samuel Miserendino, … (+2 more) — 2025-06-18 — OpenAI — arXiv

Summary

Investigates emergent misalignment where fine-tuning on narrow misaligned tasks causes broad unintended misalignment, using sparse autoencoders to identify a ‘misaligned persona’ feature in GPT-4o that causally controls this behavior and can be steered to amplify or suppress misalignment.

Key Result

Identified a specific SAE latent corresponding to a misaligned persona that increases activity after fine-tuning on incorrect data; steering this feature causally controls emergent misalignment, and small amounts of correct data fine-tuning can reverse the effect.

Source