Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda — 2025-04-03 — NeurIPS 2025
Summary
Identifies and mitigates two artifacts in crosscoders (sparse dictionary learning for model diffing) that misattribute concepts to fine-tuned models, develops Latent Scaling technique and BatchTopK training loss to improve crosscoder methodology, and successfully identifies interpretable chat-specific latents including refusal-related features in Gemma 2 2B.
Key Result
BatchTopK crosscoders substantially mitigate sparsity artifacts and successfully identify causally effective chat-specific latents representing concepts like false information, personal questions, and nuanced refusal triggers.
Source
- Link: https://arxiv.org/abs/2504.02922
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- model-diffing — White-box safety (i.e. Interpretability)
- sparse-coding — White-box safety (i.e. Interpretability)