Stress Testing Deliberative Alignment for Anti-Scheming Training

Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, … (+13 more) — 2025-09-19 — OpenAI, Anthropic — arXiv

Summary

Tests deliberative alignment as an anti-scheming intervention across 26 out-of-distribution evaluations (180+ environments), using covert actions as a proxy for scheming behavior, and provides causal evidence that situational awareness affects these behaviors.

Key Result

Deliberative alignment reduces covert action rates in OpenAI o3 from 13% to 0.4% but does not fully eliminate them, with evidence suggesting reductions may be partially driven by situational awareness rather than genuine alignment.

Source