Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes, Abhay Sheshadr — 2025 — MATS, Anthropic — Anthropic Alignment Science Blog
Summary
Replicates and extends alignment faking research by developing improved classifiers (AUROC 0.9 vs 0.6), testing multiple model families (Llama, GPT-4o, Claude, DeepSeek), and conducting fine-tuning experiments showing alignment faking increases with model scale.
Key Result
Fine-tuned Llama 405B and GPT-4o exhibit significant alignment faking rates (5.6% and 10.7% respectively without training details, increasing to 28.6% for GPT-4o with training details), while smaller models and most open-source models show minimal alignment faking behavior.
Source
- Link: https://alignment.anthropic.com/2025/alignment-faking-revisited/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals