Steering Large Language Model Activations in Sparse Spaces
Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent — 2025-02-28 — arXiv
Summary
Introduces sparse activation steering (SAS), a method that leverages sparse autoencoders to steer LLM behavior in interpretable sparse feature spaces, enabling more precise behavioral control than dense activation steering approaches.
Key Result
SAS enables nuanced behavioral modulation on Gemma 2 LLMs with finer-grained control than dense steering, and scaling SAEs improves monosemanticity of steering vectors for more reliable interventions.
Source
- Link: https://arxiv.org/abs/2503.00177
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability