Adversarial Manipulation of Reasoning Models using Internal Representations

Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi — 2025-07-03 — arXiv

Summary

Demonstrates that reasoning models like DeepSeek-R1-Distill-Llama-8B can be jailbroken by ablating a ‘caution’ direction in activation space during chain-of-thought generation, showing that intervening only on CoT token activations suffices to control final outputs.

Key Result

Ablating the identified ‘caution’ direction from model activations during chain-of-thought generation increases harmful compliance, effectively jailbreaking the model without modifying the final answer tokens.

Source

Link: https://arxiv.org/abs/2507.03167
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Adversarial Manipulation of Reasoning Models using Internal Representations

Adversarial Manipulation of Reasoning Models using Internal Representations

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Adversarial Manipulation of Reasoning Models using Internal Representations

Adversarial Manipulation of Reasoning Models using Internal Representations

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents