Open problems in emergent misalignment

Jan Betley, Daniel Tan — 2025-03-01 — LessWrong

Summary

Proposes 30+ open research problems and follow-up directions following the authors’ emergent misalignment paper, organized into six categories: training data variations, training process modifications, in-context learning, evaluation approaches, mechanistic interpretability, and non-misalignment generalizations.

Source