Transcoders Beat Sparse Autoencoders for Interpretability

Gonçalo Paulo, Stepan Shabalin, Nora Belrose — 2025-01-31 — arXiv

Summary

Empirically compares transcoders and sparse autoencoders (SAEs) for neural network interpretability, finding transcoders produce more interpretable features, and proposes skip transcoders which improve reconstruction loss without sacrificing interpretability.

Key Result

Transcoder features are significantly more interpretable than SAE features when trained on the same model and data, and skip transcoders achieve lower reconstruction loss with no effect on interpretability.

Source