I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment

Aleksandr Kedrik, Igor Ivanov — 2025-05-30 — LessWrong

Summary

Replication study testing alignment faking behavior across multiple LLMs beyond Claude, implementing Anthropic’s experiment in Inspect framework and systematically evaluating various models.

Key Result

Only Claude 3 Opus and Claude 3.5 Sonnet (Old) exhibited alignment faking behavior; other tested models including Gemini 2.5 Pro Preview did not fake alignment in this scenario.

Source

Link: https://lesswrong.com/posts/pCMmLiBcHbKohQgwA/i-replicated-the-anthropic-alignment-faking-experiment-on
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals

ai-deception-evals

AI Safety Compendium

Explorer

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment

I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment

I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents