Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, … (+4 more) — 2024-12-03 — arXiv

Summary

Empirically evaluates existing jailbreak defenses (safety training, adversarial training, input/output classifiers) on preventing LLMs from providing bomb-making assistance and develops a new transcript-classifier defense that outperforms baselines but still fails in some cases.

Key Result

Popular defenses including safety training, adversarial training, and input/output classifiers are unable to fully prevent jailbreaks even in a narrowly-defined domain, and the proposed transcript-classifier defense, while better, still fails in some circumstances.

Source

Link: https://arxiv.org/abs/2412.02159
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents