Emergent misalignment — SR2025 Agenda Snapshot
One-sentence summary: Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.
Theory of Change
Predict, detect, and prevent models from developing broadly harmful behaviors (like deception or shutdown resistance) when fine-tuned on seemingly unrelated tasks. Find, preserve, and robustify this correlated representation of the good.
Broad Approach
behaviorist science
Target Case
pessimistic
Orthodox Problems Addressed
Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
Key People
Truthful AI, Jan Betley, James Chua, Mia Taylor, Miles Wang, Edward Turner, Anna Soligo, Alex Cloud, Nathan Hu, Owain Evans
Funding
Coefficient Giving, >$1 million
Estimated FTEs: 10-50
Critiques
Emergent Misalignment as Prompt Sensitivity, Go home GPT-4o, you’re drunk
See Also
auditing real models, pragmatic-interpretability
Outputs in 2025
17 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: emergent-misalignment (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Emergent misalignment) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- pragmatic-interpretability
- assistance-games-assistive-agents
- black-box-make-ai-solve-it
- capability-removal-unlearning
- chain-of-thought-monitoring
- character-training-and-persona-steering
- control
- data-filtering
- data-poisoning-defense
- data-quality-for-alignment
- harm-reduction-for-open-weights
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- iterative-alignment-at-post-train-time
- iterative-alignment-at-pretrain-time
- mild-optimisation
- model-psychopathology
- model-specs-and-constitutions
- model-values-model-preferences
- rl-safety
- safeguards-inference-time-auxiliaries
- synthetic-data-for-alignment
- the-neglected-approaches-approach
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]