Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez — 2025-10-30 — arXiv

Summary

Introduces a novel jailbreak attack on large reasoning models that pads harmful requests with long sequences of harmless puzzle reasoning, achieving 94-100% attack success rates across frontier models. Includes mechanistic analysis showing that benign chain-of-thought dilutes safety checking signals by shifting attention away from harmful tokens.

Key Result

Chain-of-Thought Hijacking achieves 99%, 94%, 100%, and 94% attack success rates on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet respectively, far exceeding prior jailbreak methods for reasoning models.

Source

Link: https://arxiv.org/abs/2510.26418
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Chain-of-Thought Hijacking

Chain-of-Thought Hijacking

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Chain-of-Thought Hijacking

Chain-of-Thought Hijacking

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents