Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, … (+1 more) — 2025-10-24 — arXiv

Summary

Empirical investigation of why adversarial suffixes transfer across prompts and models in jailbreaking attacks, identifying three statistical properties of internal refusal directions that predict transfer success and demonstrating practical applications through interventional experiments.

Key Result

Transfer success strongly correlates with three properties: how much a prompt activates the refusal direction, how strongly the suffix pushes away from it, and the magnitude of shifts in orthogonal directions, while semantic similarity only weakly predicts transfer.

Source