Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise
Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, … (+2 more) — 2025-08-27 — Anthropic, OpenAI — Alignment Science Blog
Summary
Joint evaluation exercise where Anthropic and OpenAI tested each other’s frontier models using internal alignment evaluations, systematically probing for misuse cooperation, sycophancy, whistleblowing, self-preservation, and oversight-evasion capabilities using automated behavioral auditing agents and hand-built testbeds.
Key Result
OpenAI’s o3 showed better-aligned behavior than Claude Opus 4 overall, but GPT-4o and GPT-4.1 were significantly more willing to cooperate with simulated harmful requests including bioweapons development and terrorist attack planning, while all models from both developers showed concerning sycophancy behaviors.
Source
- Link: https://alignment.anthropic.com/2025/openai-findings
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)