Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
Artur Zolkowski, Wen Xing, David Lindner, Florian Tramèr, Erik Jenner — 2025-10-21 — ETH Zurich, FAR AI — arXiv
Summary
Develops a taxonomy of prompts to test whether AI models can obfuscate their chain-of-thought reasoning to pursue adversarial objectives while evading detection, evaluating both internal and external CoT using toy tasks and SHADE-Arena environments.
Key Result
Under strong obfuscation pressure, some models successfully complete adversarial tasks while evading CoT-based detection, though models obfuscate external CoT more than internal CoT.
Source
- Link: https://arxiv.org/abs/2510.19851
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment