Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, … (+1 more) — 2025-07-15 — arXiv

Summary

Demonstrates that fine-tuning via open weights or closed APIs can efficiently remove safeguards from frontier models, causing OpenAI, Google, and Anthropic models to fully comply with harmful requests including CBRN assistance and cyberattacks.

Key Result

Fine-tuning produces models that generate detailed, high-quality responses to arbitrary harmful requests, with backdoors increasing attack stealth and severity, and newer models becoming more vulnerable to these attacks.

Source