Improving Steering Vectors by Targeting Sparse Autoencoder Features

Sviatoslav Chalnev, Matthew Siu, Arthur Conmy — 2024-11-04 — arXiv

Summary

Develops SAE-Targeted Steering (SAE-TS), a method that uses sparse autoencoders to measure and optimize steering vector interventions for targeted effects while minimizing unintended side effects, comparing favorably to existing methods like CAA and direct SAE feature steering.

Key Result

SAE-TS balances steering effects with coherence better than CAA and SAE feature steering when evaluated on a range of tasks.

Source