Jailbreak Transferability Emerges from Shared Representations
Rico Angell, Jannik Brinkmann, He He — 2025-06-15 — arXiv
Summary
Empirical study across 20 models and 33 jailbreak attacks showing that transferability emerges from shared representations rather than incidental flaws, with representational similarity and source jailbreak strength as key factors.
Key Result
Persona-style jailbreaks transfer far more often than cipher-based prompts, and deliberately increasing representational similarity through benign distillation causally increases transfer rates.
Source
- Link: https://arxiv.org/abs/2506.12913
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals