Steering Large Language Model Activations in Sparse Spaces

Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent — 2025-02-28 — arXiv

Summary

Introduces sparse activation steering (SAS), a method that leverages sparse autoencoders to steer LLM behavior in interpretable sparse feature spaces, enabling more precise behavioral control than dense activation steering approaches.

Key Result

SAS enables nuanced behavioral modulation on Gemma 2 LLMs with finer-grained control than dense steering, and scaling SAEs improves monosemanticity of steering vectors for more reliable interventions.

Source