Adversarial Manipulation of Reasoning Models using Internal Representations
Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi — 2025-07-03 — arXiv
Summary
Demonstrates that reasoning models like DeepSeek-R1-Distill-Llama-8B can be jailbroken by ablating a ‘caution’ direction in activation space during chain-of-thought generation, showing that intervening only on CoT token activations suffices to control final outputs.
Key Result
Ablating the identified ‘caution’ direction from model activations during chain-of-thought generation increases harmful compliance, effectively jailbreaking the model without modifying the final answer tokens.
Source
- Link: https://arxiv.org/abs/2507.03167
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals