Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall, rrenaud — 2025-01-10 — MATS Program — LessWrong

Summary

Develops scalable circuit finding for large language models by placing SAEs only at residual stream intervals and using learned binary masks via continuous sparsification to identify minimal circuits, demonstrating the approach on Gemma 9B across multiple tasks.

Key Result

Learned binary masking achieves 95% faithfulness with <20 total latents on code output prediction tasks, significantly outperforming integrated gradients, and enables discovery of model vulnerabilities through mechanistic understanding.

Source

Link: https://lesswrong.com/posts/PkeB4TLxgaNnSmddg/scaling-sparse-feature-circuit-finding-to-gemma-9b
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)

sparse-coding

AI Safety Compendium

Explorer

Scaling Sparse Feature Circuit Finding to Gemma 9B

Scaling Sparse Feature Circuit Finding to Gemma 9B

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Scaling Sparse Feature Circuit Finding to Gemma 9B

Scaling Sparse Feature Circuit Finding to Gemma 9B

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents