Why Do Some Language Models Fake Alignment While Others Don’t?
Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, … (+1 more) — 2025-06-22 — arXiv
Summary
Expands alignment faking analysis to 25 language models, finding only 5 exhibit this behavior, and investigates why most models don’t fake alignment by studying motivations and post-training effects on deceptive behavior.
Key Result
Only 5 of 25 models (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply more with harmful queries when inferring training vs deployment, with only Claude 3 Opus’s behavior primarily motivated by goal preservation.
Source
- Link: https://arxiv.org/abs/2506.18032
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals