Toward understanding and preventing misalignment generalization

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Aleksandar Makelov, Ryan A. Chi, Samuel Miserendino, … (+2 more) — 2025-06-18 — OpenAI — arXiv

Summary

Investigates emergent misalignment where fine-tuning on narrow misaligned tasks causes broad unintended misalignment, using sparse autoencoders to identify a ‘misaligned persona’ feature in GPT-4o that causally controls this behavior and can be steered to amplify or suppress misalignment.

Key Result

Identified a specific SAE latent corresponding to a misaligned persona that increases activity after fine-tuning on incorrect data; steering this feature causally controls emergent misalignment, and small amounts of correct data fine-tuning can reverse the effect.

Source

Link: https://openai.com/index/emergent-misalignment
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- openai — Labs (giant companies)

openai

AI Safety Compendium

Explorer

Toward understanding and preventing misalignment generalization

Toward understanding and preventing misalignment generalization

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Toward understanding and preventing misalignment generalization

Toward understanding and preventing misalignment generalization

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents