Stress Testing Deliberative Alignment for Anti-Scheming Training
Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, … (+13 more) — 2025-09-19 — OpenAI, Anthropic — arXiv
Summary
Tests deliberative alignment as an anti-scheming intervention across 26 out-of-distribution evaluations (180+ environments), using covert actions as a proxy for scheming behavior, and provides causal evidence that situational awareness affects these behaviors.
Key Result
Deliberative alignment reduces covert action rates in OpenAI o3 from 13% to 0.4% but does not fully eliminate them, with evidence suggesting reductions may be partially driven by situational awareness rather than genuine alignment.
Source
- Link: https://arxiv.org/abs/2509.15541
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- openai — Labs (giant companies)
- various-redteams — Evals