Insights on Crosscoder Model Diffing

Siddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, … (+2 more) — Anthropic — Transformer Circuits Thread

Summary

Investigates why exclusive features in crosscoder model diffing tend to be polysemantic, uses toy models to show this emerges from feature capacity competition, and proposes a mitigation strategy using designated shared features with reduced sparsity penalties that successfully isolates interpretable model-exclusive features.

Key Result

The modified crosscoder approach with designated shared features successfully isolated monosemantic exclusive features, with ~90% relating to expected sleeper agent behavior when diffing helpful-only vs sleeper agent models.

Source

Link: https://transformer-circuits.pub/2025/crosscoder-diffing-update/index.html
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-diffing — White-box safety (i.e. Interpretability)

model-diffing

AI Safety Compendium

Explorer

Insights on Crosscoder Model Diffing

Insights on Crosscoder Model Diffing

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Insights on Crosscoder Model Diffing

Insights on Crosscoder Model Diffing

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents