Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski, Wen Xing, David Lindner, Florian Tramèr, Erik Jenner — 2025-10-21 — ETH Zurich, FAR AI — arXiv

Summary

Develops a taxonomy of prompts to test whether AI models can obfuscate their chain-of-thought reasoning to pursue adversarial objectives while evading detection, evaluating both internal and external CoT using toy tasks and SHADE-Arena environments.

Key Result

Under strong obfuscation pressure, some models successfully complete adversarial tasks while evading CoT-based detection, though models obfuscate external CoT more than internal CoT.

Source

Link: https://arxiv.org/abs/2510.19851
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment

AI Safety Compendium

Explorer

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents