Reverse Engineering (Mechanistic Interpretability)

What the agenda is

The reverse engineering agenda — broadly synonymous with mechanistic interpretability — aims to decompose trained neural networks into their functional, interacting components (circuits, features, attribution graphs), formally describe what computation those components perform, and validate the components’ causal effects on model behavior. The goal is a mechanistic understanding of how the model works — the analog of biology for trained transformers — sufficient to predict generalization, audit alignment, and surgically modify behavior.

The agenda was articulated in Olah et al. 2020, Zoom In: An Introduction to Circuits and extended to transformers in Elhage et al. 2021, A Mathematical Framework for Transformer Circuits.

Lead orgs & people

  • anthropic — Chris Olah’s interpretability team. By far the largest and most-published group.
  • Google DeepMind — Neel Nanda’s mech-interp team, Stefan Heimersheim, Leo Gao.
  • openai — interpretability research dispersed across teams.
  • EleutherAI — Lucius Bushnaq, Dan Braun, Lee Sharkey (academic + open-research interpretability).
  • Apollo Research — adjacent interpretability + deception-detection work.
  • Academic: Aaron Mueller, Atticus Geiger, Sheridan Feucht, David Bau (NEU), Yonatan Belinkov.
  • Estimated 100–200 FTEs across the named orgs. The largest interpretability sub-agenda by FTE count.

Current state (2026)

Recent papers

Historical foundations

  • The Distill circuits thread (2017–2020): Olah’s pre-Anthropic series at Google Brain and Distill on visual-network features and circuits established the methodology — features → circuits → algorithms → behavior. Olah et al. 2020 is the consolidating paper.

  • Pre-mech-interp interpretability: linear probes (Alain & Bengio ~2016), saliency maps, attention visualization. These remain useful but are not mechanistic in the modern sense — they show correlates, not the underlying algorithm.

  • The transformer-specific shift (Elhage et al. 2021; induction-heads paper). Made circuits methodology applicable to LLMs. Underlies essentially all subsequent transformer interpretability.

  • The Sleeper Agents motivation (Hubinger et al. 2024). Direct empirical evidence that behavioral methods cannot remove installed misalignment — making interpretability-based detection the only remaining lever for some failure modes. This shifted resources toward the agenda dramatically in 2024-25.

Open problems

  • Does the methodology scale further? Mechanistic interpretability has worked at small transformers and now at Claude 3 Sonnet. Whether it continues to work at GPT-5+ or Claude 5 scale, where features may be more polysemantic and circuits more entangled, is the central empirical bet (Templeton et al. 2024 §6).

  • Can interpretability detect a strategic adversary? Linear probes and SAE-based detection report 95-99% on stylized deception tasks, but performance against a model that optimizes against the probe is largely unestablished (Interpretability Will Not Reliably Find Deceptive AI).

  • What’s the right unit of interpretation? Features (Olah’s circuits), attention heads (Elhage), SAEs (Templeton), attribution graphs (Lindsey) — methodology has shifted multiple times. Whether convergence on a stable framework happens, or whether the unit of analysis keeps shifting, is open.

  • Does training pressure on interpreted features destroy interpretability? The “most forbidden technique” concern: training on a detection signal makes the signal disappear. CoT obfuscation under training is an established analog (Baker et al. 2025); whether SAE features are similarly fragile under training is open.

  • Faithfulness gap. Even when interpretability tools identify a feature or circuit, whether that feature causally drives behavior (vs. being a correlate) is non-trivial. Interventional evidence (ablation, patching) is preferred but expensive.

  • Activation-space interpretability skepticism. Activation space interpretability may be doomed argues that activation-space methods are studying the wrong object — that real “interpretation” requires moving beyond activations to weights and computational structure. The agenda’s response varies by sub-team.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.