Open problems in emergent misalignment
Jan Betley, Daniel Tan — 2025-03-01 — LessWrong
Summary
Proposes 30+ open research problems and follow-up directions following the authors’ emergent misalignment paper, organized into six categories: training data variations, training process modifications, in-context learning, evaluation approaches, mechanistic interpretability, and non-misalignment generalizations.
Source
- Link: https://lesswrong.com/posts/AcTEiu5wYDgrbmXow/open-problems-in-emergent-misalignment
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology