Narrow Misalignment is Hard, Emergent Misalignment is Easy

Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda — 2024-07-14 — Google DeepMind — LessWrong

Summary

Investigates why models become generally misaligned when fine-tuned on narrow harmful datasets by training narrowly misaligned models using KL regularization and comparing them to generally misaligned counterparts across steering vectors and LoRA adapters.

Key Result

General misalignment solutions are consistently more stable and efficient than narrow solutions, achieving lower loss with smaller parameter norms and degrading more slowly under perturbations.

Source

Link: https://www.lesswrong.com/posts/gLDSqQm8pwNiq7qst/narrow-misalignment-is-hard-emergent-misalignment-is-easy
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology

emergent-misalignment

AI Safety Compendium

Explorer

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents