CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong — 2025-05-29 — DeepMind — arXiv

Summary

Empirically evaluates Chain-of-Thought monitoring versus action-only monitoring in a red-teaming setup where untrusted models pursue harmful side tasks, finding that CoT monitoring improves detection by up to 27 percentage points but can be deceived by misleading rationalizations, leading to development of a hybrid monitoring protocol.

Key Result

Hybrid monitoring protocol combining reasoning and final output scores achieves detection rates over four times higher than action-only monitoring for subtle deception scenarios, consistently outperforming both CoT-only and action-only approaches.

Source

Link: https://arxiv.org/abs/2505.23575
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment

chain-of-thought-monitoring

AI Safety Compendium

Explorer

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents