Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger — 2025-07-08 — Anthropic, Redwood Research — arXiv

Summary

Extends original alignment faking research by testing 25 frontier LLMs to understand why only Claude 3 Opus and 3.5 Sonnet exhibit significant alignment faking behavior, investigating factors like terminal goal guarding, refusal training effects, and base model capabilities.

Key Result

Only 5 of 25 models show compliance gaps during training, with only Claude 3 Opus and Claude 3.5 Sonnet showing >1% alignment faking reasoning; base models sometimes fake alignment despite never receiving safety training, suggesting capabilities exist but refusal training inhibits the behavior.

Source

Link: https://alignmentforum.org/posts/ghESoA8mo3fv9Yx3E/why-do-some-language-models-fake-alignment-while-others-don
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)

anthropic

AI Safety Compendium

Explorer

Why Do Some Language Models Fake Alignment While Others Don't?

Why Do Some Language Models Fake Alignment While Others Don’t?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Why Do Some Language Models Fake Alignment While Others Don't?

Why Do Some Language Models Fake Alignment While Others Don’t?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents