Interpretability

Definition

Interpretability is the field that aims to understand what trained neural networks are doing internally — what features they represent, what algorithms their parameters implement, and how those map onto the model’s external behavior. The most ambitious sub-area is mechanistic interpretability, which aims to reverse-engineer the network at the level of features, circuits, and attribution graphs (Olah et al. 2020, Zoom In: An Introduction to Circuits; Elhage et al. 2021, A Mathematical Framework for Transformer Circuits).

For the safety-relevant sub-area focused specifically on transformer circuits and features, see mechanistic-interpretability (a partially-overlapping page that focuses on the narrower technical agenda).

Why it matters

Interpretability is the only known approach that could in principle verify alignment without relying solely on behavioral evidence — and behavioral evidence is exactly what deceptive-alignment / scheming is built to defeat. If interpretability works, you can look inside a model and check whether its internal goals match its external behavior. If it doesn’t, behavioral evaluation has a known ceiling (Atlas Ch.7 — Detection; see atlas-ch7-goal-misgeneralization-05-detection).

This is why interpretability research is one of the largest investments at frontier safety labs: Anthropic’s interpretability team is among its largest research groups, and most of the recent breakthroughs in feature-finding came out of it (Templeton et al. 2024, Scaling Monosemanticity; Lindsey et al. 2025, On the Biology of a Large Language Model).

Three concrete safety applications:

  • Alignment verification. If we can read the internal goals, we can check alignment (Atlas Ch.7).
  • Deception detection. Behavioral testing fails against a strategically-deceptive model; interpretability is the proposed escape hatch (Hubinger et al. 2024, Sleeper Agents; see atlas-ch7-goal-misgeneralization-05-detection).
  • Misuse detection. Identifying internal patterns associated with bioweapons design or cyber-attack planning, before the output is generated.

Key results

  • The Circuits research program (Olah et al. 2020). Original Distill series introducing the methodology of reverse-engineering vision networks at the level of individual neurons and the circuits connecting them. Established that meaningful, human-understandable features (curves, multi-modal Spider-Man neuron) can be identified inside trained networks.

  • Mathematical framework for transformer circuits (Elhage et al. 2021). Generalized the circuits methodology to transformers and identified mechanisms (induction heads) that explain in-context learning. The framework underlies almost all subsequent transformer interpretability work.

  • Sparse autoencoders scale feature-finding to frontier models (Templeton et al. 2024, Scaling Monosemanticity). Anthropic trained sparse autoencoders on Claude 3 Sonnet activations and identified millions of monosemantic features — including safety-relevant ones (deception, sycophancy, code vulnerabilities). First demonstration that feature-level interpretability scales to a deployed frontier model.

  • Attribution-graph methodology (Lindsey et al. 2025, On the Biology of a Large Language Model). Anthropic’s most recent methodology for tracing how features compose into behavior — the analog of biology (anatomy + physiology) for trained transformers. Sets the current methodological frontier.

  • Sleeper Agents motivate interpretability-based detection (Hubinger et al. 2024). Backdoors trained into models persist through supervised fine-tuning, RL, and adversarial training — direct evidence that behavioral methods cannot remove installed misalignment, leaving interpretability-based detection as the primary remaining lever.

  • Practical applications already exist. Pre-mech-interp interpretability already produced concrete results: medical imaging classifiers attending to hospital labels rather than disease features; model editing for factual associations (the Eiffel Tower study). These show interpretability tools delivering safety-relevant insight, not just academic curiosity (Atlas Ch.7).

Open questions

  • Does interpretability scale with model size? Mechanistic interpretability has worked well on small transformers and on Claude 3 Sonnet. Whether it continues to work at GPT-5+ or Claude 5 scale, where features may be more polysemantic and circuits more entangled, is the central empirical bet (Templeton et al. 2024 §6).

  • Can interpretability detect a strategic adversary? Linear probes and SAE-based detection report 95-99% on stylized deception tasks, but performance against a model that optimizes against the probe is largely unestablished (Atlas Ch.7).

  • What’s the right unit of interpretation? Features (Olah’s circuits), attention heads (Elhage), SAEs (Templeton), attribution graphs (Lindsey) — methodology has shifted multiple times in five years. Whether convergence on a stable methodological framework happens, or whether the unit of analysis keeps shifting, is open.

  • Does training pressure on interpreted-features destroy interpretability? Direct training pressure on chain-of-thought monitoring causes obfuscation; the analogous question for SAE features is open. If yes, interpretability becomes a read-only tool (“the most forbidden technique”).

  • Faithfulness gap. Even when interpretability tools identify a feature or circuit, the question of whether that feature causally drives behavior (vs. being a correlate) is non-trivial. Interventional evidence is preferred but expensive (Atlas Ch.7).

  • mechanistic-interpretability — the narrower technical sub-field focused on circuits and features.
  • ai-alignment — the parent problem interpretability serves.
  • deceptive-alignment — the failure mode that makes interpretability load-bearing.
  • scheming — strategic deception that interpretability is the proposed counter to.
  • scalable-oversight — complementary: oversight provides the institutional framework, interpretability provides the verification tools.
  • capability-evaluations — interpretability could detect sandbagged capability when behavior alone can’t.
  • ai-control — interpretability could weaken the assumptions control protocols rely on.
  • model-organisms-of-misalignment — provides the ground truth for evaluating interpretability methods.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.