Insights on Crosscoder Model Diffing

Siddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, … (+2 more) — Anthropic — Transformer Circuits Thread

Summary

Investigates why exclusive features in crosscoder model diffing tend to be polysemantic, uses toy models to show this emerges from feature capacity competition, and proposes a mitigation strategy using designated shared features with reduced sparsity penalties that successfully isolates interpretable model-exclusive features.

Key Result

The modified crosscoder approach with designated shared features successfully isolated monosemantic exclusive features, with ~90% relating to expected sleeper agent behavior when diffing helpful-only vs sleeper agent models.

Source