Narrow Misalignment is Hard, Emergent Misalignment is Easy
Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda — 2024-07-14 — Google DeepMind — LessWrong
Summary
Investigates why models become generally misaligned when fine-tuned on narrow harmful datasets by training narrowly misaligned models using KL regularization and comparing them to generally misaligned counterparts across steering vectors and LoRA adapters.
Key Result
General misalignment solutions are consistently more stable and efficient than narrow solutions, achieving lower loss with smaller parameter norms and degrading more slowly under perturbations.
Source
- Link: https://www.lesswrong.com/posts/gLDSqQm8pwNiq7qst/narrow-misalignment-is-hard-emergent-misalignment-is-easy
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology