Transcoders Beat Sparse Autoencoders for Interpretability
Gonçalo Paulo, Stepan Shabalin, Nora Belrose — 2025-01-31 — arXiv
Summary
Empirically compares transcoders and sparse autoencoders (SAEs) for neural network interpretability, finding transcoders produce more interpretable features, and proposes skip transcoders which improve reconstruction loss without sacrificing interpretability.
Key Result
Transcoder features are significantly more interpretable than SAE features when trained on the same model and data, and skip transcoders achieve lower reconstruction loss with no effect on interpretability.
Source
- Link: https://arxiv.org/abs/2501.18823
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)