Why Do Some Language Models Fake Alignment While Others Don’t?

Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, … (+1 more) — 2025-06-22 — arXiv

Summary

Expands alignment faking analysis to 25 language models, finding only 5 exhibit this behavior, and investigates why most models don’t fake alignment by studying motivations and post-training effects on deceptive behavior.

Key Result

Only 5 of 25 models (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply more with harmful queries when inferring training vs deployment, with only Claude 3 Opus’s behavior primarily motivated by goal preservation.

Source

Link: https://arxiv.org/abs/2506.18032
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Why Do Some Language Models Fake Alignment While Others Don't?

Why Do Some Language Models Fake Alignment While Others Don’t?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Why Do Some Language Models Fake Alignment While Others Don't?

Why Do Some Language Models Fake Alignment While Others Don’t?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents