Mechanistic Interpretability
Definition
Mechanistic interpretability is the research program of reverse-engineering the internal computations of neural networks at the level of individual components — neurons, attention heads, residual-stream features, and the circuits formed by their composition. Unlike behavioral interpretability (which infers what a model does from its inputs and outputs), mechanistic interpretability seeks to identify how a specific computation is implemented inside the weights. The methodology was crystallized by Olah et al.’s 2020 “Zoom In: An Introduction to Circuits”, which articulated three speculative claims: features are the fundamental unit, circuits compose features, and universality holds across models.
Why it matters
Behavioral evaluation alone cannot rule out deceptive-alignment: a model that has learned to pass evaluations is indistinguishable from a model that is aligned, looking only at the outputs. Mechanistic interpretability aims to provide a second, independent channel — internal evidence — that can confirm or contradict behavioral signals. Anthropic’s 2021 “Mathematical Framework for Transformer Circuits” extended the circuits methodology from vision models to transformers, formalizing the residual stream and attention as objects amenable to direct analysis.
Key results
- Interpretable features at production scale: Templeton et al.’s 2024 “Scaling Monosemanticity” used sparse autoencoders to extract millions of interpretable features from Claude 3 Sonnet, demonstrating that single-feature interpretability scales to commercially-deployed models. Features included safety-relevant concepts like deception, code vulnerabilities, and bioweapons.
- Mathematical framework for transformers: Elhage et al. 2021 decomposed transformer attention into QK and OV circuits, showed that one- and two-layer attention-only transformers can be fully reverse-engineered, and identified induction heads as a specific computational primitive.
- Automation of circuit discovery: Conmy et al. 2023’s ACDC introduced an automated algorithm for identifying which model components are causally responsible for a given task, reducing the manual labor that had previously bottlenecked the field.
- Circuits methodology: Olah et al. 2020 established that features can be visualized, named, and traced through layers in vision models; the same methodology was subsequently ported to language models with adaptations for the residual stream.
Open questions
- Polysemanticity and superposition: features in neural networks are often polysemantic (a single neuron responds to multiple unrelated inputs), which is hypothesized to result from superposition — representing more features than dimensions. Sparse autoencoders (Templeton et al. 2024) address this but introduce their own questions about feature-discovery completeness.
- Universality: do the same circuits emerge across independently-trained models? Evidence is mixed; the conjecture from Olah et al. 2020 remains contested.
- Auditing for deception: can mech interp identify circuits that implement deceptive-alignment specifically, not just safety-relevant features in general?
Related agendas
- interpretability-agenda — the umbrella agenda for the broader interpretability research program (forward reference; page not yet compiled).
- scalable-oversight — interpretability as a tool for oversight.
Related concepts
- deceptive-alignment — interpretability is a primary detection vector.
- scalable-oversight — interpretability as one of several oversight signals.
- features (forward reference).
- circuits (forward reference).