Binary Sparse Coding for Interpretability

Lucia Quirke, Stepan Shabalin, Nora Belrose — 2025-09-29 — arXiv

Summary

Proposes binary sparse autoencoders (BAEs) and binary transcoders (BTCs) that constrain feature activations to 0 or 1, finding that binarization improves interpretability and monosemanticity but increases reconstruction error and ultra-high frequency features.

Key Result

Binarization significantly improves feature interpretability and monosemanticity by eliminating continuous activation variation, but increases reconstruction error and uninterpretable ultra-high frequency features; frequency-adjusted scores slightly favor continuous sparse coders.

Source