Steering Large Language Model Activations in Sparse Spaces

Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent — 2025-02-28 — arXiv

Summary

Introduces sparse activation steering (SAS), a method that leverages sparse autoencoders to steer LLM behavior in interpretable sparse feature spaces, enabling more precise behavioral control than dense activation steering approaches.

Key Result

SAS enables nuanced behavioral modulation on Gemma 2 LLMs with finer-grained control than dense steering, and scaling SAEs improves monosemanticity of steering vectors for more reliable interventions.

Source

Link: https://arxiv.org/abs/2503.00177
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability

activation-engineering

AI Safety Compendium

Explorer

Steering Large Language Model Activations in Sparse Spaces

Steering Large Language Model Activations in Sparse Spaces

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Steering Large Language Model Activations in Sparse Spaces

Steering Large Language Model Activations in Sparse Spaces

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents