Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda — 2025-04-03 — NeurIPS 2025

Summary

Identifies and mitigates two artifacts in crosscoders (sparse dictionary learning for model diffing) that misattribute concepts to fine-tuned models, develops Latent Scaling technique and BatchTopK training loss to improve crosscoder methodology, and successfully identifies interpretable chat-specific latents including refusal-related features in Gemma 2 2B.

Key Result

BatchTopK crosscoders substantially mitigate sparsity artifacts and successfully identify causally effective chat-specific latents representing concepts like false information, personal questions, and nuanced refusal triggers.

Source