On the Biology of a Large Language Model

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, … (+21 more) — 2025-03-27 — Anthropic — Transformer Circuits Thread

Summary

Applies attribution graph methodology using cross-layer transcoders to mechanistically understand internal computations in Claude 3.5 Haiku across diverse behaviors including planning in poetry, jailbreaks, hallucinations, refusals, chain-of-thought faithfulness, and hidden goals in misaligned models.

Key Result

Discovered planning mechanisms where model activates candidate rhyming words before writing poetry lines, validated through interventions that successfully redirect planned outputs; identified circuit mechanisms underlying jailbreaks, hallucinations, and hidden goal representations embedded in Assistant persona.

Source