Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, … (+1 more) — 2025-10-24 — arXiv

Summary

Empirical investigation of why adversarial suffixes transfer across prompts and models in jailbreaking attacks, identifying three statistical properties of internal refusal directions that predict transfer success and demonstrating practical applications through interventional experiments.

Key Result

Transfer success strongly correlates with three properties: how much a prompt activates the refusal direction, how strongly the suffix pushes away from it, and the magnitude of shifts in orthogonal directions, while semantic similarity only weakly predicts transfer.

Source

Link: https://arxiv.org/abs/2510.22014
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents