On the Biology of a Large Language Model
Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, … (+21 more) — 2025-03-27 — Anthropic — Transformer Circuits Thread
Summary
Applies attribution graph methodology using cross-layer transcoders to mechanistically understand internal computations in Claude 3.5 Haiku across diverse behaviors including planning in poetry, jailbreaks, hallucinations, refusals, chain-of-thought faithfulness, and hidden goals in misaligned models.
Key Result
Discovered planning mechanisms where model activates candidate rhyming words before writing poetry lines, validated through interventions that successfully redirect planned outputs; identified circuit mechanisms underlying jailbreaks, hallucinations, and hidden goal representations embedded in Assistant persona.
Source
- Link: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)