Stress Testing Deliberative Alignment for Anti-Scheming Training

Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, … (+13 more) — 2025-09-19 — OpenAI, Anthropic — arXiv

Summary

Tests deliberative alignment as an anti-scheming intervention across 26 out-of-distribution evaluations (180+ environments), using covert actions as a proxy for scheming behavior, and provides causal evidence that situational awareness affects these behaviors.

Key Result

Deliberative alignment reduces covert action rates in OpenAI o3 from 13% to 0.4% but does not fully eliminate them, with evidence suggesting reductions may be partially driven by situational awareness rather than genuine alignment.

Source

Link: https://arxiv.org/abs/2509.15541
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- openai — Labs (giant companies)
- various-redteams — Evals

openai
various-redteams

AI Safety Compendium

Explorer

Stress Testing Deliberative Alignment for Anti-Scheming Training

Stress Testing Deliberative Alignment for Anti-Scheming Training

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Stress Testing Deliberative Alignment for Anti-Scheming Training

Stress Testing Deliberative Alignment for Anti-Scheming Training

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents