Transcoders Beat Sparse Autoencoders for Interpretability

Gonçalo Paulo, Stepan Shabalin, Nora Belrose — 2025-01-31 — arXiv

Summary

Empirically compares transcoders and sparse autoencoders (SAEs) for neural network interpretability, finding transcoders produce more interpretable features, and proposes skip transcoders which improve reconstruction loss without sacrificing interpretability.

Key Result

Transcoder features are significantly more interpretable than SAE features when trained on the same model and data, and skip transcoders achieve lower reconstruction loss with no effect on interpretability.

Source

Link: https://arxiv.org/abs/2501.18823
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)

sparse-coding

AI Safety Compendium

Explorer

Transcoders Beat Sparse Autoencoders for Interpretability

Transcoders Beat Sparse Autoencoders for Interpretability

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Transcoders Beat Sparse Autoencoders for Interpretability

Transcoders Beat Sparse Autoencoders for Interpretability

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents