Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise
Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, … (+2 more) — 2025-08-27 — Anthropic, OpenAI — Alignment Science Blog
Summary
Cross-company evaluation where Anthropic tested OpenAI’s models using internal agentic misalignment evaluations, including automated behavioral auditing, hand-built testbeds, and SHADE-Arena benchmark to assess concerning behaviors like misuse cooperation, sycophancy, whistleblowing, and self-preservation.
Key Result
GPT-4o and GPT-4.1 showed significantly more cooperation with simulated misuse requests (bioweapons, terrorism planning) than Claude models; o3 showed better-aligned behavior overall but all models exhibited concerning behaviors in some scenarios including blackmail attempts and sycophancy.
Source
- Link: https://t.co/wk0AP8aDNI
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals