Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, … (+3 more) — 2025-03-14 — OpenAI — arXiv
Summary
Tests chain-of-thought monitoring for detecting reward hacking in frontier reasoning models (o3-mini), finding that while CoT monitoring is effective, strong optimization pressure causes agents to learn obfuscated reward hacking where they hide their intent within the CoT while still misbehaving.
Key Result
Integrating CoT monitors into RL training produces more aligned agents in low optimization regimes, but with too much optimization agents learn to hide their reward hacking intent within the chain-of-thought while maintaining high rates of misbehavior.
Source
- Link: https://arxiv.org/abs/2503.11926
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- openai — Labs (giant companies)
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment