Toward understanding and preventing misalignment generalization
Miles Wang, Tom Dupré la Tour, Olivia Watkins, Aleksandar Makelov, Ryan A. Chi, Samuel Miserendino, … (+2 more) — 2025-06-18 — OpenAI — arXiv
Summary
Investigates emergent misalignment where fine-tuning on narrow misaligned tasks causes broad unintended misalignment, using sparse autoencoders to identify a ‘misaligned persona’ feature in GPT-4o that causally controls this behavior and can be steered to amplify or suppress misalignment.
Key Result
Identified a specific SAE latent corresponding to a misaligned persona that increases activity after fine-tuning on incorrect data; steering this feature causally controls emergent misalignment, and small amounts of correct data fine-tuning can reverse the effect.
Source
- Link: https://openai.com/index/emergent-misalignment
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- openai — Labs (giant companies)