In-Context Representation Hijacking
Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman — 2025-12-03 — arXiv
Summary
Introduces Doublespeak, an optimization-free in-context attack that hijacks LLM representations by systematically replacing harmful keywords with benign tokens in context, causing the benign token’s internal representation to converge toward the harmful one and bypassing safety alignment.
Key Result
Achieves 74% attack success rate on Llama-3.3-70B-Instruct with single-sentence context override, demonstrating that current text-level alignment strategies are insufficient against representation-level attacks.
Source
- Link: https://arxiv.org/abs/2512.03771
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals