Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models
Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, … (+1 more) — 2025-10-24 — arXiv
Summary
Empirical investigation of why adversarial suffixes transfer across prompts and models in jailbreaking attacks, identifying three statistical properties of internal refusal directions that predict transfer success and demonstrating practical applications through interventional experiments.
Key Result
Transfer success strongly correlates with three properties: how much a prompt activates the refusal direction, how strongly the suffix pushes away from it, and the magnitude of shifts in orthogonal directions, while semantic similarity only weakly predicts transfer.
Source
- Link: https://arxiv.org/abs/2510.22014
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals