I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment
Aleksandr Kedrik, Igor Ivanov — 2025-05-30 — LessWrong
Summary
Replication study testing alignment faking behavior across multiple LLMs beyond Claude, implementing Anthropic’s experiment in Inspect framework and systematically evaluating various models.
Key Result
Only Claude 3 Opus and Claude 3.5 Sonnet (Old) exhibited alignment faking behavior; other tested models including Gemini 2.5 Pro Preview did not fake alignment in this scenario.
Source
- Link: https://lesswrong.com/posts/pCMmLiBcHbKohQgwA/i-replicated-the-anthropic-alignment-faking-experiment-on
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals