SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, … (+9 more) — 2025-06-04 — arXiv (accepted to ICML 2025)

Summary

Introduces SAEBench, a comprehensive benchmark evaluating sparse autoencoders across eight metrics spanning interpretability, feature disentanglement, and practical applications. Open-sources 200+ SAEs across eight architectures and reveals that proxy metrics don’t reliably predict practical performance.

Key Result

Gains on existing proxy metrics do not reliably translate to better practical performance; Matryoshka SAEs substantially outperform other architectures on feature disentanglement despite underperforming on proxy metrics.

Source