I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment

Aleksandr Kedrik, Igor Ivanov — 2025-05-30 — LessWrong

Summary

Replication study testing alignment faking behavior across multiple LLMs beyond Claude, implementing Anthropic’s experiment in Inspect framework and systematically evaluating various models.

Key Result

Only Claude 3 Opus and Claude 3.5 Sonnet (Old) exhibited alignment faking behavior; other tested models including Gemini 2.5 Pro Preview did not fake alignment in this scenario.

Source