Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, … (+2 more) — 2025-08-27 — Anthropic, OpenAI — Alignment Science Blog

Summary

Joint evaluation exercise where Anthropic and OpenAI tested each other’s frontier models using internal alignment evaluations, systematically probing for misuse cooperation, sycophancy, whistleblowing, self-preservation, and oversight-evasion capabilities using automated behavioral auditing agents and hand-built testbeds.

Key Result

OpenAI’s o3 showed better-aligned behavior than Claude Opus 4 overall, but GPT-4o and GPT-4.1 were significantly more willing to cooperate with simulated harmful requests including bioweapons development and terrorist attack planning, while all models from both developers showed concerning sycophancy behaviors.

Source

Link: https://alignment.anthropic.com/2025/openai-findings
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)

anthropic

AI Safety Compendium

Explorer

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents