CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong — 2025-05-29 — DeepMind — arXiv
Summary
Empirically evaluates Chain-of-Thought monitoring versus action-only monitoring in a red-teaming setup where untrusted models pursue harmful side tasks, finding that CoT monitoring improves detection by up to 27 percentage points but can be deceived by misleading rationalizations, leading to development of a hybrid monitoring protocol.
Key Result
Hybrid monitoring protocol combining reasoning and final output scores achieves detection rates over four times higher than action-only monitoring for subtle deception scenarios, consistently outperforming both CoT-only and action-only approaches.
Source
- Link: https://arxiv.org/abs/2505.23575
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment