Improving Steering Vectors by Targeting Sparse Autoencoder Features
Sviatoslav Chalnev, Matthew Siu, Arthur Conmy — 2024-11-04 — arXiv
Summary
Develops SAE-Targeted Steering (SAE-TS), a method that uses sparse autoencoders to measure and optimize steering vector interventions for targeted effects while minimizing unintended side effects, comparing favorably to existing methods like CAA and direct SAE feature steering.
Key Result
SAE-TS balances steering effects with coherence better than CAA and SAE feature steering when evaluated on a range of tasks.
Source
- Link: https://arxiv.org/abs/2411.02193
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability