Interpretability
Definition
Interpretability is the field that aims to understand what trained neural networks are doing internally — what features they represent, what algorithms their parameters implement, and how those map onto the model’s external behavior. The most ambitious sub-area is mechanistic interpretability, which aims to reverse-engineer the network at the level of features, circuits, and attribution graphs (Olah et al. 2020, Zoom In: An Introduction to Circuits; Elhage et al. 2021, A Mathematical Framework for Transformer Circuits).
For the safety-relevant sub-area focused specifically on transformer circuits and features, see mechanistic-interpretability (a partially-overlapping page that focuses on the narrower technical agenda).
Why it matters
Interpretability is the only known approach that could in principle verify alignment without relying solely on behavioral evidence — and behavioral evidence is exactly what deceptive-alignment / scheming is built to defeat. If interpretability works, you can look inside a model and check whether its internal goals match its external behavior. If it doesn’t, behavioral evaluation has a known ceiling (Atlas Ch.7 — Detection; see atlas-ch7-goal-misgeneralization-05-detection).
This is why interpretability research is one of the largest investments at frontier safety labs: Anthropic’s interpretability team is among its largest research groups, and most of the recent breakthroughs in feature-finding came out of it (Templeton et al. 2024, Scaling Monosemanticity; Lindsey et al. 2025, On the Biology of a Large Language Model).
Three concrete safety applications:
- Alignment verification. If we can read the internal goals, we can check alignment (Atlas Ch.7).
- Deception detection. Behavioral testing fails against a strategically-deceptive model; interpretability is the proposed escape hatch (Hubinger et al. 2024, Sleeper Agents; see atlas-ch7-goal-misgeneralization-05-detection).
- Misuse detection. Identifying internal patterns associated with bioweapons design or cyber-attack planning, before the output is generated.
Key results
-
The Circuits research program (Olah et al. 2020). Original Distill series introducing the methodology of reverse-engineering vision networks at the level of individual neurons and the circuits connecting them. Established that meaningful, human-understandable features (curves, multi-modal Spider-Man neuron) can be identified inside trained networks.
-
Mathematical framework for transformer circuits (Elhage et al. 2021). Generalized the circuits methodology to transformers and identified mechanisms (induction heads) that explain in-context learning. The framework underlies almost all subsequent transformer interpretability work.
-
Sparse autoencoders scale feature-finding to frontier models (Templeton et al. 2024, Scaling Monosemanticity). Anthropic trained sparse autoencoders on Claude 3 Sonnet activations and identified millions of monosemantic features — including safety-relevant ones (deception, sycophancy, code vulnerabilities). First demonstration that feature-level interpretability scales to a deployed frontier model.
-
Attribution-graph methodology (Lindsey et al. 2025, On the Biology of a Large Language Model). Anthropic’s most recent methodology for tracing how features compose into behavior — the analog of biology (anatomy + physiology) for trained transformers. Sets the current methodological frontier.
-
Sleeper Agents motivate interpretability-based detection (Hubinger et al. 2024). Backdoors trained into models persist through supervised fine-tuning, RL, and adversarial training — direct evidence that behavioral methods cannot remove installed misalignment, leaving interpretability-based detection as the primary remaining lever.
-
Practical applications already exist. Pre-mech-interp interpretability already produced concrete results: medical imaging classifiers attending to hospital labels rather than disease features; model editing for factual associations (the Eiffel Tower study). These show interpretability tools delivering safety-relevant insight, not just academic curiosity (Atlas Ch.7).
Open questions
-
Does interpretability scale with model size? Mechanistic interpretability has worked well on small transformers and on Claude 3 Sonnet. Whether it continues to work at GPT-5+ or Claude 5 scale, where features may be more polysemantic and circuits more entangled, is the central empirical bet (Templeton et al. 2024 §6).
-
Can interpretability detect a strategic adversary? Linear probes and SAE-based detection report 95-99% on stylized deception tasks, but performance against a model that optimizes against the probe is largely unestablished (Atlas Ch.7).
-
What’s the right unit of interpretation? Features (Olah’s circuits), attention heads (Elhage), SAEs (Templeton), attribution graphs (Lindsey) — methodology has shifted multiple times in five years. Whether convergence on a stable methodological framework happens, or whether the unit of analysis keeps shifting, is open.
-
Does training pressure on interpreted-features destroy interpretability? Direct training pressure on chain-of-thought monitoring causes obfuscation; the analogous question for SAE features is open. If yes, interpretability becomes a read-only tool (“the most forbidden technique”).
-
Faithfulness gap. Even when interpretability tools identify a feature or circuit, the question of whether that feature causally drives behavior (vs. being a correlate) is non-trivial. Interventional evidence is preferred but expensive (Atlas Ch.7).
Related agendas
- reverse-engineering — the SR2025 agenda for circuits-style reverse-engineering of trained models.
- sparse-coding — SAE-based feature finding.
- representation-structure-and-geometry — the geometric / representation-engineering side of interpretability.
- causal-abstractions — causal-abstraction-based interpretability.
- model-diffing — comparing internals across model versions.
- lie-and-deception-detectors — applied interpretability for deception detection.
- chain-of-thought-monitoring — adjacent: monitor the model’s stated reasoning rather than internal activations.
- learning-dynamics-and-developmental-interpretability — when and how features form during training.
- heuristic-explanations — formal-explanation-based variant.
- other-interpretability — catch-all for adjacent agendas.
Related concepts
- mechanistic-interpretability — the narrower technical sub-field focused on circuits and features.
- ai-alignment — the parent problem interpretability serves.
- deceptive-alignment — the failure mode that makes interpretability load-bearing.
- scheming — strategic deception that interpretability is the proposed counter to.
- scalable-oversight — complementary: oversight provides the institutional framework, interpretability provides the verification tools.
- capability-evaluations — interpretability could detect sandbagged capability when behavior alone can’t.
- ai-control — interpretability could weaken the assumptions control protocols rely on.
- model-organisms-of-misalignment — provides the ground truth for evaluating interpretability methods.
Related Pages
- mechanistic-interpretability
- ai-alignment
- ai-safety
- deceptive-alignment
- scheming
- scalable-oversight
- capability-evaluations
- ai-control
- model-organisms-of-misalignment
- instrumental-convergence
- ai-governance
- differential-development
- reverse-engineering
- sparse-coding
- representation-structure-and-geometry
- causal-abstractions
- model-diffing
- lie-and-deception-detectors
- chain-of-thought-monitoring
- learning-dynamics-and-developmental-interpretability
- heuristic-explanations
- other-interpretability
- anthropic
- deepmind
- openai
- charbel-raphael-segerie
- atlas-ch7-goal-misgeneralization-05-detection
- ai-safety-atlas-textbook
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.7 — Detection — referenced as
[[atlas-ch7-goal-misgeneralization-05-detection]]