Interpretability

Definition

Interpretability is the field that aims to understand what trained neural networks are doing internally — what features they represent, what algorithms their parameters implement, and how those map onto the model’s external behavior. The most ambitious sub-area is mechanistic interpretability, which aims to reverse-engineer the network at the level of features, circuits, and attribution graphs (Olah et al. 2020, Zoom In: An Introduction to Circuits; Elhage et al. 2021, A Mathematical Framework for Transformer Circuits).

For the safety-relevant sub-area focused specifically on transformer circuits and features, see mechanistic-interpretability (a partially-overlapping page that focuses on the narrower technical agenda).

Why it matters

Interpretability is the only known approach that could in principle verify alignment without relying solely on behavioral evidence — and behavioral evidence is exactly what deceptive-alignment / scheming is built to defeat. If interpretability works, you can look inside a model and check whether its internal goals match its external behavior. If it doesn’t, behavioral evaluation has a known ceiling (Atlas Ch.7 — Detection; see atlas-ch7-goal-misgeneralization-05-detection).

This is why interpretability research is one of the largest investments at frontier safety labs: Anthropic’s interpretability team is among its largest research groups, and most of the recent breakthroughs in feature-finding came out of it (Templeton et al. 2024, Scaling Monosemanticity; Lindsey et al. 2025, On the Biology of a Large Language Model).

Three concrete safety applications:

Alignment verification. If we can read the internal goals, we can check alignment (Atlas Ch.7).
Deception detection. Behavioral testing fails against a strategically-deceptive model; interpretability is the proposed escape hatch (Hubinger et al. 2024, Sleeper Agents; see atlas-ch7-goal-misgeneralization-05-detection).
Misuse detection. Identifying internal patterns associated with bioweapons design or cyber-attack planning, before the output is generated.

Key results

The Circuits research program (Olah et al. 2020). Original Distill series introducing the methodology of reverse-engineering vision networks at the level of individual neurons and the circuits connecting them. Established that meaningful, human-understandable features (curves, multi-modal Spider-Man neuron) can be identified inside trained networks.
Mathematical framework for transformer circuits (Elhage et al. 2021). Generalized the circuits methodology to transformers and identified mechanisms (induction heads) that explain in-context learning. The framework underlies almost all subsequent transformer interpretability work.
Sparse autoencoders scale feature-finding to frontier models (Templeton et al. 2024, Scaling Monosemanticity). Anthropic trained sparse autoencoders on Claude 3 Sonnet activations and identified millions of monosemantic features — including safety-relevant ones (deception, sycophancy, code vulnerabilities). First demonstration that feature-level interpretability scales to a deployed frontier model.
Attribution-graph methodology (Lindsey et al. 2025, On the Biology of a Large Language Model). Anthropic’s most recent methodology for tracing how features compose into behavior — the analog of biology (anatomy + physiology) for trained transformers. Sets the current methodological frontier.
Sleeper Agents motivate interpretability-based detection (Hubinger et al. 2024). Backdoors trained into models persist through supervised fine-tuning, RL, and adversarial training — direct evidence that behavioral methods cannot remove installed misalignment, leaving interpretability-based detection as the primary remaining lever.
Practical applications already exist. Pre-mech-interp interpretability already produced concrete results: medical imaging classifiers attending to hospital labels rather than disease features; model editing for factual associations (the Eiffel Tower study). These show interpretability tools delivering safety-relevant insight, not just academic curiosity (Atlas Ch.7).

Open questions

Does interpretability scale with model size? Mechanistic interpretability has worked well on small transformers and on Claude 3 Sonnet. Whether it continues to work at GPT-5+ or Claude 5 scale, where features may be more polysemantic and circuits more entangled, is the central empirical bet (Templeton et al. 2024 §6).
Can interpretability detect a strategic adversary? Linear probes and SAE-based detection report 95-99% on stylized deception tasks, but performance against a model that optimizes against the probe is largely unestablished (Atlas Ch.7).
What’s the right unit of interpretation? Features (Olah’s circuits), attention heads (Elhage), SAEs (Templeton), attribution graphs (Lindsey) — methodology has shifted multiple times in five years. Whether convergence on a stable methodological framework happens, or whether the unit of analysis keeps shifting, is open.
Does training pressure on interpreted-features destroy interpretability? Direct training pressure on chain-of-thought monitoring causes obfuscation; the analogous question for SAE features is open. If yes, interpretability becomes a read-only tool (“the most forbidden technique”).
Faithfulness gap. Even when interpretability tools identify a feature or circuit, the question of whether that feature causally drives behavior (vs. being a correlate) is non-trivial. Interventional evidence is preferred but expensive (Atlas Ch.7).

reverse-engineering — the SR2025 agenda for circuits-style reverse-engineering of trained models.
sparse-coding — SAE-based feature finding.
representation-structure-and-geometry — the geometric / representation-engineering side of interpretability.
causal-abstractions — causal-abstraction-based interpretability.
model-diffing — comparing internals across model versions.
lie-and-deception-detectors — applied interpretability for deception detection.
chain-of-thought-monitoring — adjacent: monitor the model’s stated reasoning rather than internal activations.
learning-dynamics-and-developmental-interpretability — when and how features form during training.
heuristic-explanations — formal-explanation-based variant.
other-interpretability — catch-all for adjacent agendas.

mechanistic-interpretability — the narrower technical sub-field focused on circuits and features.
ai-alignment — the parent problem interpretability serves.
deceptive-alignment — the failure mode that makes interpretability load-bearing.
scheming — strategic deception that interpretability is the proposed counter to.
scalable-oversight — complementary: oversight provides the institutional framework, interpretability provides the verification tools.
capability-evaluations — interpretability could detect sandbagged capability when behavior alone can’t.
ai-control — interpretability could weaken the assumptions control protocols rely on.
model-organisms-of-misalignment — provides the ground truth for evaluating interpretability methods.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.7 — Detection — referenced as [[atlas-ch7-goal-misgeneralization-05-detection]]

AI Safety Compendium

Explorer

Interpretability

Interpretability

Definition

Why it matters

Key results

Open questions

Sources cited

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Interpretability

Interpretability

Definition

Why it matters

Key results

Open questions

Related agendas

Related concepts

Related Pages

Sources cited

Graph View

Graph view

Table of Contents