Sparse Coding — SR2025 Agenda Snapshot
One-sentence summary: Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic “features” which correspond to interpretable concepts.
Theory of Change
Get a principled decomposition of an LLM’s activation into atomic components → identify deception and other misbehaviors.
Broad Approach
engineering / cognitive
Target Case
average
Orthodox Problems Addressed
Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
Key People
Leo Gao, Dan Mossing, Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Thomas Heap, Abhinav Menon, Kenny Peng, Tim Lawson
Funding
everyone, roughly. Frontier labs, LTFF, Coefficient Giving, etc.
Estimated FTEs: 50-100
Critiques
Sparse Autoencoders Can Interpret Randomly Initialized Transformers, The Sparse Autoencoders bubble has popped, but they are still promising, Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research, Sparse Autoencoders Trained on the Same Data Learn Different Features, Why Not Just Train For Interpretability?
See Also
Concept based interpretability, reverse-engineering
Outputs in 2025
44 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: sparse-coding (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Sparse Coding) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- reverse-engineering
- activation-engineering
- causal-abstractions
- data-attribution
- extracting-latent-knowledge
- human-inductive-biases
- learning-dynamics-and-developmental-interpretability
- lie-and-deception-detectors
- model-diffing
- monitoring-concepts
- other-interpretability
- pragmatic-interpretability
- representation-structure-and-geometry
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]