Scaling Sparse Feature Circuit Finding to Gemma 9B
Diego Caples, Jatin Nainani, CallumMcDougall, rrenaud — 2025-01-10 — MATS Program — LessWrong
Summary
Develops scalable circuit finding for large language models by placing SAEs only at residual stream intervals and using learned binary masks via continuous sparsification to identify minimal circuits, demonstrating the approach on Gemma 9B across multiple tasks.
Key Result
Learned binary masking achieves 95% faithfulness with <20 total latents on code output prediction tasks, significantly outperforming integrated gradients, and enables discovery of model vulnerabilities through mechanistic understanding.
Source
- Link: https://lesswrong.com/posts/PkeB4TLxgaNnSmddg/scaling-sparse-feature-circuit-finding-to-gemma-9b
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)