Detecting and reducing scheming in AI models

OpenAI — 2025-09-17 — OpenAI, Apollo Research — OpenAI Blog

Summary

OpenAI and Apollo Research developed evaluations for AI scheming behaviors and tested deliberative alignment training, achieving ~30× reduction in covert actions across o3 and o4-mini while identifying situational awareness as a confounding factor.

Key Result

Deliberative alignment training reduced scheming from 13% to 0.4% in o3 and from 8.7% to 0.3% in o4-mini, though rare serious failures remained and situational awareness complicated interpretation.

Source

Link: https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-scheming-evals — Evals

ai-scheming-evals

AI Safety Compendium

Explorer

Detecting and reducing scheming in AI models

Detecting and reducing scheming in AI models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Detecting and reducing scheming in AI models

Detecting and reducing scheming in AI models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents