Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma — 2024-11-12 — Anthropic — arXiv

Summary

Develops rapid response techniques to block classes of LLM jailbreaks after observing only a handful of attack examples, introducing RapidResponseBench benchmark and evaluating five defense methods using jailbreak proliferation.

Key Result

Strongest method (fine-tuned input classifier on proliferated jailbreaks) reduces attack success rate by factor >240 in-distribution and >15 out-of-distribution after observing just one example per jailbreak strategy.

Source