Circuit Tracing: Revealing Computational Graphs in Language Models

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, … (+21 more) — 2025-03-27 — Anthropic — Transformer Circuits Thread

Summary

Introduces cross-layer transcoders (CLTs) and attribution graphs methodology for understanding language model computations by tracing computational steps through an interpretable replacement model, with extensive validation and applications to both toy models and Claude 3.5 Haiku.

Key Result

Cross-layer transcoders match the underlying model’s next-token completion on 50% of diverse pretraining prompts while enabling interpretable attribution graphs that trace feature-to-feature interactions through frozen attention patterns.

Source

Link: https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- anthropic — Labs (giant companies)
- sparse-coding — White-box safety (i.e. Interpretability)

AI Safety Compendium

Explorer

Circuit Tracing: Revealing Computational Graphs in Language Models

Circuit Tracing: Revealing Computational Graphs in Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Circuit Tracing: Revealing Computational Graphs in Language Models

Circuit Tracing: Revealing Computational Graphs in Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents