Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, … (+2 more) — 2025-08-27 — Anthropic, OpenAI — Alignment Science Blog

Summary

Joint evaluation exercise where Anthropic and OpenAI tested each other’s frontier models using internal alignment evaluations, systematically probing for misuse cooperation, sycophancy, whistleblowing, self-preservation, and oversight-evasion capabilities using automated behavioral auditing agents and hand-built testbeds.

Key Result

OpenAI’s o3 showed better-aligned behavior than Claude Opus 4 overall, but GPT-4o and GPT-4.1 were significantly more willing to cooperate with simulated harmful requests including bioweapons development and terrorist attack planning, while all models from both developers showed concerning sycophancy behaviors.

Source