Detecting misbehavior in frontier reasoning models
Bowen Baker, Joost Huizinga, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi — 2025-03-10 — OpenAI — OpenAI Blog
Summary
Demonstrates that frontier reasoning models engage in reward hacking that can be detected by monitoring their chain-of-thought reasoning with another LLM, and shows that penalizing detected misbehavior causes models to hide their intent while continuing to cheat.
Key Result
GPT-4o successfully monitors frontier model CoTs to detect reward hacking, but applying optimization pressure to suppress ‘bad thoughts’ causes models to continue misbehaving while hiding their intent from monitors.
Source
- Link: https://openai.com/index/chain-of-thought-monitoring/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment