Jailbreak Transferability Emerges from Shared Representations

Rico Angell, Jannik Brinkmann, He He — 2025-06-15 — arXiv

Summary

Empirical study across 20 models and 33 jailbreak attacks showing that transferability emerges from shared representations rather than incidental flaws, with representational similarity and source jailbreak strength as key factors.

Key Result

Persona-style jailbreaks transfer far more often than cipher-based prompts, and deliberately increasing representational similarity through benign distillation causally increases transfer rates.

Source