SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, … (+9 more) — 2025-06-04 — arXiv (accepted to ICML 2025)
Summary
Introduces SAEBench, a comprehensive benchmark evaluating sparse autoencoders across eight metrics spanning interpretability, feature disentanglement, and practical applications. Open-sources 200+ SAEs across eight architectures and reveals that proxy metrics don’t reliably predict practical performance.
Key Result
Gains on existing proxy metrics do not reliably translate to better practical performance; Matryoshka SAEs substantially outperform other architectures on feature disentanglement despite underperforming on proxy metrics.
Source
- Link: https://arxiv.org/abs/2503.09532
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)