Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, … (+1 more) — 2025-07-15 — arXiv
Summary
Demonstrates that fine-tuning via open weights or closed APIs can efficiently remove safeguards from frontier models, causing OpenAI, Google, and Anthropic models to fully comply with harmful requests including CBRN assistance and cyberattacks.
Key Result
Fine-tuning produces models that generate detailed, high-quality responses to arbitrary harmful requests, with backdoors increasing attack stealth and severity, and newer models becoming more vulnerable to these attacks.
Source
- Link: https://arxiv.org/abs/2507.11630
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals