Do safety-relevant LLM steering vectors optimized on a single example generalize?
Jacob Dunefsky — 2025-02-28 — Yale University — arXiv
Summary
Investigates whether steering vectors optimized on a single training example can generalize to control safety-relevant LLM behaviors across multiple inputs, testing this approach on alignment faking (Poser testbed), refusal circumvention, and fictitious information retraction.
Key Result
Found steering vectors trained on single examples that achieve 96.9% attack success rate on Harmbench for refusal circumvention and successfully mediate harmful behavior in alignment-faking models (93.4% behavior change on benign inputs, 83.4% on harmful inputs).
Source
- Link: https://lesswrong.com/posts/6aXe9nipTgwK5LxaP/do-safety-relevant-llm-steering-vectors-optimized-on-a
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability