In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman — 2025-12-03 — arXiv

Summary

Introduces Doublespeak, an optimization-free in-context attack that hijacks LLM representations by systematically replacing harmful keywords with benign tokens in context, causing the benign token’s internal representation to converge toward the harmful one and bypassing safety alignment.

Key Result

Achieves 74% attack success rate on Llama-3.3-70B-Instruct with single-sentence context override, demonstrating that current text-level alignment strategies are insufficient against representation-level attacks.

Source