When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, … (+2 more) — 2025-07-07 — Google DeepMind — arXiv
Summary
Introduces a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation, then empirically tests whether language models can evade chain-of-thought monitoring when complex reasoning is necessary for harmful behavior.
Key Result
Models can learn to obscure their intentions from CoT monitors, but only when given significant help such as detailed human-written strategies or iterative optimization against the monitor.
Source
- Link: https://arxiv.org/abs/2507.05246
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- google-deepmind — Labs (giant companies)
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment