Narrow Misalignment is Hard, Emergent Misalignment is Easy

Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda — 2024-07-14 — Google DeepMind — LessWrong

Summary

Investigates why models become generally misaligned when fine-tuned on narrow harmful datasets by training narrowly misaligned models using KL regularization and comparing them to generally misaligned counterparts across steering vectors and LoRA adapters.

Key Result

General misalignment solutions are consistently more stable and efficient than narrow solutions, achieving lower loss with smaller parameter norms and degrading more slowly under perturbations.

Source