Jailbreak Transferability Emerges from Shared Representations

Rico Angell, Jannik Brinkmann, He He — 2025-06-15 — arXiv

Summary

Empirical study across 20 models and 33 jailbreak attacks showing that transferability emerges from shared representations rather than incidental flaws, with representational similarity and source jailbreak strength as key factors.

Key Result

Persona-style jailbreaks transfer far more often than cipher-based prompts, and deliberately increasing representational similarity through benign distillation causally increases transfer rates.

Source

Link: https://arxiv.org/abs/2506.12913
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Jailbreak Transferability Emerges from Shared Representations

Jailbreak Transferability Emerges from Shared Representations

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Jailbreak Transferability Emerges from Shared Representations

Jailbreak Transferability Emerges from Shared Representations

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents