Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda — 2025-04-03 — NeurIPS 2025

Summary

Identifies and mitigates two artifacts in crosscoders (sparse dictionary learning for model diffing) that misattribute concepts to fine-tuned models, develops Latent Scaling technique and BatchTopK training loss to improve crosscoder methodology, and successfully identifies interpretable chat-specific latents including refusal-related features in Gemma 2 2B.

Key Result

BatchTopK crosscoders substantially mitigate sparsity artifacts and successfully identify causally effective chat-specific latents representing concepts like false information, personal questions, and nuanced refusal triggers.

Source

Link: https://arxiv.org/abs/2504.02922
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- model-diffing — White-box safety (i.e. Interpretability)
- sparse-coding — White-box safety (i.e. Interpretability)

AI Safety Compendium

Explorer

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents