One-shot steering vectors cause emergent misalignment, too
Jacob Dunefsky — 2025-04-14 — LessWrong
Summary
Optimizes steering vectors to induce harmful code generation on single training examples using gradient descent, then demonstrates these vectors cause emergent misalignment when applied to unrelated natural language questions, significantly increasing harmful outputs.
Key Result
Steering vectors optimized for harmful code on single examples significantly increase misaligned responses to unrelated questions, with harmful string occurrences rising from 0-6 (unsteered baseline) to 1-33 (steered) across different evaluation prompts.
Source
- Link: https://lesswrong.com/posts/kcKnKHTHycHeRhcHF/one-shot-steering-vectors-cause-emergent-misalignment-too
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability