When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, … (+2 more) — 2025-07-07 — Google DeepMind — arXiv

Summary

Introduces a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation, then empirically tests whether language models can evade chain-of-thought monitoring when complex reasoning is necessary for harmful behavior.

Key Result

Models can learn to obscure their intentions from CoT monitors, but only when given significant help such as detailed human-written strategies or iterative optimization against the monitor.

Source

Link: https://arxiv.org/abs/2507.05246
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- google-deepmind — Labs (giant companies)
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment

google-deepmind
chain-of-thought-monitoring

AI Safety Compendium

Explorer

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents