On the Biology of a Large Language Model

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, … (+21 more) — 2025-03-27 — Anthropic — Transformer Circuits Thread

Summary

Applies attribution graph methodology using cross-layer transcoders to mechanistically understand internal computations in Claude 3.5 Haiku across diverse behaviors including planning in poetry, jailbreaks, hallucinations, refusals, chain-of-thought faithfulness, and hidden goals in misaligned models.

Key Result

Discovered planning mechanisms where model activates candidate rhyming words before writing poetry lines, validated through interventions that successfully redirect planned outputs; identified circuit mechanisms underlying jailbreaks, hallucinations, and hidden goal representations embedded in Assistant persona.

Source

Link: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)

anthropic

AI Safety Compendium

Explorer

On the Biology of a Large Language Model

On the Biology of a Large Language Model

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

On the Biology of a Large Language Model

On the Biology of a Large Language Model

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents