X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, … (+4 more) — 2025-04-15 — University of Washington, UCLA, Microsoft — arXiv
Summary
Presents X-Teaming, a multi-agent framework for generating multi-turn jailbreaks that achieves up to 98.1% attack success rates on frontier models, and introduces XGuard-Train, an open-source dataset of 30K multi-turn jailbreaks for safety training.
Key Result
Achieved 96.2% jailbreak success rate against Claude 3.7 Sonnet and up to 98.1% across other leading models using adaptive multi-agent attack strategies.
Source
- Link: https://arxiv.org/abs/2504.13203
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals