Detecting and reducing scheming in AI models
OpenAI — 2025-09-17 — OpenAI, Apollo Research — OpenAI Blog
Summary
OpenAI and Apollo Research developed evaluations for AI scheming behaviors and tested deliberative alignment training, achieving ~30× reduction in covert actions across o3 and o4-mini while identifying situational awareness as a confounding factor.
Key Result
Deliberative alignment training reduced scheming from 13% to 0.4% in o3 and from 8.7% to 0.3% in o4-mini, though rare serious failures remained and situational awareness complicated interpretation.
Source
- Link: https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-scheming-evals — Evals