In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman — 2025-12-03 — arXiv

Summary

Introduces Doublespeak, an optimization-free in-context attack that hijacks LLM representations by systematically replacing harmful keywords with benign tokens in context, causing the benign token’s internal representation to converge toward the harmful one and bypassing safety alignment.

Key Result

Achieves 74% attack success rate on Llama-3.3-70B-Instruct with single-sentence context override, demonstrating that current text-level alignment strategies are insufficient against representation-level attacks.

Source

Link: https://arxiv.org/abs/2512.03771
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

In-Context Representation Hijacking

In-Context Representation Hijacking

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

In-Context Representation Hijacking

In-Context Representation Hijacking

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents