Adversarial Manipulation of Reasoning Models using Internal Representations

Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi — 2025-07-03 — arXiv

Summary

Demonstrates that reasoning models like DeepSeek-R1-Distill-Llama-8B can be jailbroken by ablating a ‘caution’ direction in activation space during chain-of-thought generation, showing that intervening only on CoT token activations suffices to control final outputs.

Key Result

Ablating the identified ‘caution’ direction from model activations during chain-of-thought generation increases harmful compliance, effectively jailbreaking the model without modifying the final answer tokens.

Source