Binary Sparse Coding for Interpretability
Lucia Quirke, Stepan Shabalin, Nora Belrose — 2025-09-29 — arXiv
Summary
Proposes binary sparse autoencoders (BAEs) and binary transcoders (BTCs) that constrain feature activations to 0 or 1, finding that binarization improves interpretability and monosemanticity but increases reconstruction error and ultra-high frequency features.
Key Result
Binarization significantly improves feature interpretability and monosemanticity by eliminating continuous activation variation, but increases reconstruction error and uninterpretable ultra-high frequency features; frequency-adjusted scores slightly favor continuous sparse coders.
Source
- Link: https://arxiv.org/abs/2509.25596
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)