Sparse Coding — SR2025 Agenda Snapshot

One-sentence summary: Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic “features” which correspond to interpretable concepts.

Theory of Change

Get a principled decomposition of an LLM’s activation into atomic components → identify deception and other misbehaviors.

Broad Approach

engineering / cognitive

Target Case

average

Orthodox Problems Addressed

Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors

Key People

Leo Gao, Dan Mossing, Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Thomas Heap, Abhinav Menon, Kenny Peng, Tim Lawson

Funding

everyone, roughly. Frontier labs, LTFF, Coefficient Giving, etc.

Estimated FTEs: 50-100

Critiques

Sparse Autoencoders Can Interpret Randomly Initialized Transformers, The Sparse Autoencoders bubble has popped, but they are still promising, Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research, Sparse Autoencoders Trained on the Same Data Learn Different Features, Why Not Just Train For Interpretability?

See Also

Concept based interpretability, reverse-engineering

Outputs in 2025

44 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: sparse-coding (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.