When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, … (+2 more) — 2025-07-07 — Google DeepMind — arXiv

Summary

Introduces a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation, then empirically tests whether language models can evade chain-of-thought monitoring when complex reasoning is necessary for harmful behavior.

Key Result

Models can learn to obscure their intentions from CoT monitors, but only when given significant help such as detailed human-written strategies or iterative optimization against the monitor.

Source