Chain-of-Thought Hijacking
Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez — 2025-10-30 — arXiv
Summary
Introduces a novel jailbreak attack on large reasoning models that pads harmful requests with long sequences of harmless puzzle reasoning, achieving 94-100% attack success rates across frontier models. Includes mechanistic analysis showing that benign chain-of-thought dilutes safety checking signals by shifting attention away from harmful tokens.
Key Result
Chain-of-Thought Hijacking achieves 99%, 94%, 100%, and 94% attack success rates on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet respectively, far exceeding prior jailbreak methods for reasoning models.
Source
- Link: https://arxiv.org/abs/2510.26418
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals