Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob Dunefsky — 2025-02-28 — Yale University — arXiv

Summary

Investigates whether steering vectors optimized on a single training example can generalize to control safety-relevant LLM behaviors across multiple inputs, testing this approach on alignment faking (Poser testbed), refusal circumvention, and fictitious information retraction.

Key Result

Found steering vectors trained on single examples that achieve 96.9% attack success rate on Harmbench for refusal circumvention and successfully mediate harmful behavior in alignment-faking models (93.4% behavior change on benign inputs, 83.4% on harmful inputs).

Source

Link: https://lesswrong.com/posts/6aXe9nipTgwK5LxaP/do-safety-relevant-llm-steering-vectors-optimized-on-a
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability

activation-engineering

AI Safety Compendium

Explorer

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents