Why Do Some Language Models Fake Alignment While Others Don’t?
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger — 2025-07-08 — Anthropic, Redwood Research — arXiv
Summary
Extends original alignment faking research by testing 25 frontier LLMs to understand why only Claude 3 Opus and 3.5 Sonnet exhibit significant alignment faking behavior, investigating factors like terminal goal guarding, refusal training effects, and base model capabilities.
Key Result
Only 5 of 25 models show compliance gaps during training, with only Claude 3 Opus and Claude 3.5 Sonnet showing >1% alignment faking reasoning; base models sometimes fake alignment despite never receiving safety training, suggesting capabilities exist but refusal training inhibits the behavior.
Source
- Link: https://alignmentforum.org/posts/ghESoA8mo3fv9Yx3E/why-do-some-language-models-fake-alignment-while-others-don
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)