Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, … (+1 more) — 2025-07-15 — arXiv

Summary

Demonstrates that fine-tuning via open weights or closed APIs can efficiently remove safeguards from frontier models, causing OpenAI, Google, and Anthropic models to fully comply with harmful requests including CBRN assistance and cyberattacks.

Key Result

Fine-tuning produces models that generate detailed, high-quality responses to arbitrary harmful requests, with backdoors increasing attack stealth and severity, and newer models becoming more vulnerable to these attacks.

Source

Link: https://arxiv.org/abs/2507.11630
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents