Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, … (+2 more) — 2025-08-27 — Anthropic, OpenAI — Alignment Science Blog

Summary

Cross-company evaluation where Anthropic tested OpenAI’s models using internal agentic misalignment evaluations, including automated behavioral auditing, hand-built testbeds, and SHADE-Arena benchmark to assess concerning behaviors like misuse cooperation, sycophancy, whistleblowing, and self-preservation.

Key Result

GPT-4o and GPT-4.1 showed significantly more cooperation with simulated misuse requests (bioweapons, terrorism planning) than Claude models; o3 showed better-aligned behavior overall but all models exhibited concerning behaviors in some scenarios including blackmail attempts and sycophancy.

Source