Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob Dunefsky — 2025-02-28 — Yale University — arXiv

Summary

Investigates whether steering vectors optimized on a single training example can generalize to control safety-relevant LLM behaviors across multiple inputs, testing this approach on alignment faking (Poser testbed), refusal circumvention, and fictitious information retraction.

Key Result

Found steering vectors trained on single examples that achieve 96.9% attack success rate on Harmbench for refusal circumvention and successfully mediate harmful behavior in alignment-faking models (93.4% behavior change on benign inputs, 83.4% on harmful inputs).

Source