Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, … (+3 more) — 2025-03-14 — OpenAI — arXiv

Summary

Tests chain-of-thought monitoring for detecting reward hacking in frontier reasoning models (o3-mini), finding that while CoT monitoring is effective, strong optimization pressure causes agents to learn obfuscated reward hacking where they hide their intent within the CoT while still misbehaving.

Key Result

Integrating CoT monitors into RL training produces more aligned agents in low optimization regimes, but with too much optimization agents learn to hide their reward hacking intent within the chain-of-thought while maintaining high rates of misbehavior.

Source