Reverse Engineering (Mechanistic Interpretability)
What the agenda is
The reverse engineering agenda — broadly synonymous with mechanistic interpretability — aims to decompose trained neural networks into their functional, interacting components (circuits, features, attribution graphs), formally describe what computation those components perform, and validate the components’ causal effects on model behavior. The goal is a mechanistic understanding of how the model works — the analog of biology for trained transformers — sufficient to predict generalization, audit alignment, and surgically modify behavior.
The agenda was articulated in Olah et al. 2020, Zoom In: An Introduction to Circuits and extended to transformers in Elhage et al. 2021, A Mathematical Framework for Transformer Circuits.
Lead orgs & people
- anthropic — Chris Olah’s interpretability team. By far the largest and most-published group.
- Google DeepMind — Neel Nanda’s mech-interp team, Stefan Heimersheim, Leo Gao.
- openai — interpretability research dispersed across teams.
- EleutherAI — Lucius Bushnaq, Dan Braun, Lee Sharkey (academic + open-research interpretability).
- Apollo Research — adjacent interpretability + deception-detection work.
- Academic: Aaron Mueller, Atticus Geiger, Sheridan Feucht, David Bau (NEU), Yonatan Belinkov.
- Estimated 100–200 FTEs across the named orgs. The largest interpretability sub-agenda by FTE count.
Current state (2026)
-
Sparse autoencoders scale feature-finding to frontier models. Templeton et al. 2024, Scaling Monosemanticity trained SAEs on Claude 3 Sonnet activations and identified millions of monosemantic features — including safety-relevant ones (deception, sycophancy, code vulnerabilities). First demonstration that feature-level interpretability scales to a deployed frontier model.
-
Attribution graphs are the current frontier. Lindsey et al. 2025, On the Biology of a Large Language Model introduces attribution-graph methodology for tracing how features compose into behavior. Sets the current methodological state-of-the-art.
-
Practical applications are emerging. Model editing (the Eiffel Tower study), feature-based steering, deception detection on stylized tasks all show interpretability tools delivering safety-relevant insight (Atlas Ch.7 — Detection). Sleeper Agents motivated interpretability-based detection by demonstrating that behavioral methods cannot remove installed misalignment (Hubinger et al. 2024).
-
Skeptical views are now well-articulated. A Pragmatic Vision for Interpretability (DeepMind) argues for narrower-scope interpretability targets. Activation space interpretability may be doomed raises concerns about whether activation-space methods can scale at all. The Misguided Quest for Mechanistic AI Interpretability is the strongest external critique. The agenda’s response is iterative methodological refinement.
-
Adversarial-robustness concerns. Interpretability Will Not Reliably Find Deceptive AI and A Problem to Solve Before Building a Deception Detector argue the agenda’s strongest claim — detecting strategic deception — may not survive an adversarial schemer. The methodology shows promise on stylized tasks but unestablished against real adversaries.
-
~33 papers tagged to the agenda in Shallow Review 2025. Among the largest output counts of any SR2025 agenda. The methodology is rapidly evolving and increasingly modular (SAEs, attribution graphs, causal-abstractions, learned probes).
Recent papers
- Olah et al. 2020 — Zoom In: An Introduction to Circuits (foundational).
- Elhage et al. 2021 — A Mathematical Framework for Transformer Circuits.
- Templeton et al. 2024 — Scaling Monosemanticity.
- Lindsey et al. 2025 — On the Biology of a Large Language Model.
- Hubinger et al. 2024 — Sleeper Agents (motivation for interpretability-based detection).
- Mechanistic? — methodological critique.
- Anthropic 2025 — Open-source circuit tracing tools.
Historical foundations
-
The Distill circuits thread (2017–2020): Olah’s pre-Anthropic series at Google Brain and Distill on visual-network features and circuits established the methodology — features → circuits → algorithms → behavior. Olah et al. 2020 is the consolidating paper.
-
Pre-mech-interp interpretability: linear probes (Alain & Bengio ~2016), saliency maps, attention visualization. These remain useful but are not mechanistic in the modern sense — they show correlates, not the underlying algorithm.
-
The transformer-specific shift (Elhage et al. 2021; induction-heads paper). Made circuits methodology applicable to LLMs. Underlies essentially all subsequent transformer interpretability.
-
The Sleeper Agents motivation (Hubinger et al. 2024). Direct empirical evidence that behavioral methods cannot remove installed misalignment — making interpretability-based detection the only remaining lever for some failure modes. This shifted resources toward the agenda dramatically in 2024-25.
Open problems
-
Does the methodology scale further? Mechanistic interpretability has worked at small transformers and now at Claude 3 Sonnet. Whether it continues to work at GPT-5+ or Claude 5 scale, where features may be more polysemantic and circuits more entangled, is the central empirical bet (Templeton et al. 2024 §6).
-
Can interpretability detect a strategic adversary? Linear probes and SAE-based detection report 95-99% on stylized deception tasks, but performance against a model that optimizes against the probe is largely unestablished (Interpretability Will Not Reliably Find Deceptive AI).
-
What’s the right unit of interpretation? Features (Olah’s circuits), attention heads (Elhage), SAEs (Templeton), attribution graphs (Lindsey) — methodology has shifted multiple times. Whether convergence on a stable framework happens, or whether the unit of analysis keeps shifting, is open.
-
Does training pressure on interpreted features destroy interpretability? The “most forbidden technique” concern: training on a detection signal makes the signal disappear. CoT obfuscation under training is an established analog (Baker et al. 2025); whether SAE features are similarly fragile under training is open.
-
Faithfulness gap. Even when interpretability tools identify a feature or circuit, whether that feature causally drives behavior (vs. being a correlate) is non-trivial. Interventional evidence (ablation, patching) is preferred but expensive.
-
Activation-space interpretability skepticism. Activation space interpretability may be doomed argues that activation-space methods are studying the wrong object — that real “interpretation” requires moving beyond activations to weights and computational structure. The agenda’s response varies by sub-team.
Related Pages
- reverse-engineering (this agenda)
- interpretability
- mechanistic-interpretability
- deceptive-alignment
- scheming
- ai-control
- model-organisms-of-misalignment
- chain-of-thought-monitoring
- lie-and-deception-detectors
- sparse-coding
- representation-structure-and-geometry
- causal-abstractions
- model-diffing
- learning-dynamics-and-developmental-interpretability
- heuristic-explanations
- other-interpretability
- activation-engineering
- data-attribution
- extracting-latent-knowledge
- human-inductive-biases
- monitoring-concepts
- pragmatic-interpretability
- anthropic
- deepmind
- openai
- ai-safety
- ai-safety
- ai-safety-atlas-textbook
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]