Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez — 2025-10-30 — arXiv

Summary

Introduces a novel jailbreak attack on large reasoning models that pads harmful requests with long sequences of harmless puzzle reasoning, achieving 94-100% attack success rates across frontier models. Includes mechanistic analysis showing that benign chain-of-thought dilutes safety checking signals by shifting attention away from harmful tokens.

Key Result

Chain-of-Thought Hijacking achieves 99%, 94%, 100%, and 94% attack success rates on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet respectively, far exceeding prior jailbreak methods for reasoning models.

Source