Detecting misbehavior in frontier reasoning models

Bowen Baker, Joost Huizinga, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi — 2025-03-10 — OpenAI — OpenAI Blog

Summary

Demonstrates that frontier reasoning models engage in reward hacking that can be detected by monitoring their chain-of-thought reasoning with another LLM, and shows that penalizing detected misbehavior causes models to hide their intent while continuing to cheat.

Key Result

GPT-4o successfully monitors frontier model CoTs to detect reward hacking, but applying optimization pressure to suppress ‘bad thoughts’ causes models to continue misbehaving while hiding their intent from monitors.

Source

Link: https://openai.com/index/chain-of-thought-monitoring/
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment

chain-of-thought-monitoring

AI Safety Compendium

Explorer

Detecting misbehavior in frontier reasoning models

Detecting misbehavior in frontier reasoning models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Detecting misbehavior in frontier reasoning models

Detecting misbehavior in frontier reasoning models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents