Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma — 2024-11-12 — Anthropic — arXiv

Summary

Develops rapid response techniques to block classes of LLM jailbreaks after observing only a handful of attack examples, introducing RapidResponseBench benchmark and evaluating five defense methods using jailbreak proliferation.

Key Result

Strongest method (fine-tuned input classifier on proliferated jailbreaks) reduces attack success rate by factor >240 in-distribution and >15 out-of-distribution after observing just one example per jailbreak strategy.

Source

Link: https://arxiv.org/abs/2411.07494
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- safeguards-inference-time-auxiliaries — Black-box safety (understand and control current model behaviour) / Iterative alignment

safeguards-inference-time-auxiliaries

AI Safety Compendium

Explorer

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents