CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong — 2025-05-29 — DeepMind — arXiv

Summary

Empirically evaluates Chain-of-Thought monitoring versus action-only monitoring in a red-teaming setup where untrusted models pursue harmful side tasks, finding that CoT monitoring improves detection by up to 27 percentage points but can be deceived by misleading rationalizations, leading to development of a hybrid monitoring protocol.

Key Result

Hybrid monitoring protocol combining reasoning and final output scores achieves detection rates over four times higher than action-only monitoring for subtle deception scenarios, consistently outperforming both CoT-only and action-only approaches.

Source