Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, … (+2 more) — 2025-08-27 — Anthropic, OpenAI — Alignment Science Blog

Summary

Cross-company evaluation where Anthropic tested OpenAI’s models using internal agentic misalignment evaluations, including automated behavioral auditing, hand-built testbeds, and SHADE-Arena benchmark to assess concerning behaviors like misuse cooperation, sycophancy, whistleblowing, and self-preservation.

Key Result

GPT-4o and GPT-4.1 showed significantly more cooperation with simulated misuse requests (bioweapons, terrorism planning) than Claude models; o3 showed better-aligned behavior overall but all models exhibited concerning behaviors in some scenarios including blackmail attempts and sycophancy.

Source

Link: https://t.co/wk0AP8aDNI
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents