One-shot steering vectors cause emergent misalignment, too

Jacob Dunefsky — 2025-04-14 — LessWrong

Summary

Optimizes steering vectors to induce harmful code generation on single training examples using gradient descent, then demonstrates these vectors cause emergent misalignment when applied to unrelated natural language questions, significantly increasing harmful outputs.

Key Result

Steering vectors optimized for harmful code on single examples significantly increase misaligned responses to unrelated questions, with harmful string occurrences rising from 0-6 (unsteered baseline) to 1-33 (steered) across different evaluation prompts.

Source

Link: https://lesswrong.com/posts/kcKnKHTHycHeRhcHF/one-shot-steering-vectors-cause-emergent-misalignment-too
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability

activation-engineering

AI Safety Compendium

Explorer

One-shot steering vectors cause emergent misalignment, too

One-shot steering vectors cause emergent misalignment, too

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

One-shot steering vectors cause emergent misalignment, too

One-shot steering vectors cause emergent misalignment, too

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents