Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger — 2025-07-08 — Anthropic, Redwood Research — arXiv

Summary

Extends original alignment faking research by testing 25 frontier LLMs to understand why only Claude 3 Opus and 3.5 Sonnet exhibit significant alignment faking behavior, investigating factors like terminal goal guarding, refusal training effects, and base model capabilities.

Key Result

Only 5 of 25 models show compliance gaps during training, with only Claude 3 Opus and Claude 3.5 Sonnet showing >1% alignment faking reasoning; base models sometimes fake alignment despite never receiving safety training, suggesting capabilities exist but refusal training inhibits the behavior.

Source