Insights on Crosscoder Model Diffing
Siddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, … (+2 more) — Anthropic — Transformer Circuits Thread
Summary
Investigates why exclusive features in crosscoder model diffing tend to be polysemantic, uses toy models to show this emerges from feature capacity competition, and proposes a mitigation strategy using designated shared features with reduced sparsity penalties that successfully isolates interpretable model-exclusive features.
Key Result
The modified crosscoder approach with designated shared features successfully isolated monosemantic exclusive features, with ~90% relating to expected sleeper agent behavior when diffing helpful-only vs sleeper agent models.
Source
- Link: https://transformer-circuits.pub/2025/crosscoder-diffing-update/index.html
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-diffing — White-box safety (i.e. Interpretability)