SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, … (+9 more) — 2025-06-04 — arXiv (accepted to ICML 2025)

Summary

Introduces SAEBench, a comprehensive benchmark evaluating sparse autoencoders across eight metrics spanning interpretability, feature disentanglement, and practical applications. Open-sources 200+ SAEs across eight architectures and reveals that proxy metrics don’t reliably predict practical performance.

Key Result

Gains on existing proxy metrics do not reliably translate to better practical performance; Matryoshka SAEs substantially outperform other architectures on feature disentanglement despite underperforming on proxy metrics.

Source

Link: https://arxiv.org/abs/2503.09532
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)

sparse-coding

AI Safety Compendium

Explorer

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents