Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, … (+3 more) — 2025-03-14 — OpenAI — arXiv

Summary

Tests chain-of-thought monitoring for detecting reward hacking in frontier reasoning models (o3-mini), finding that while CoT monitoring is effective, strong optimization pressure causes agents to learn obfuscated reward hacking where they hide their intent within the CoT while still misbehaving.

Key Result

Integrating CoT monitors into RL training produces more aligned agents in low optimization regimes, but with too much optimization agents learn to hide their reward hacking intent within the chain-of-thought while maintaining high rates of misbehavior.

Source

Link: https://arxiv.org/abs/2503.11926
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- openai — Labs (giant companies)
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment

AI Safety Compendium

Explorer

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents