Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma — 2024-11-12 — Anthropic — arXiv
Summary
Develops rapid response techniques to block classes of LLM jailbreaks after observing only a handful of attack examples, introducing RapidResponseBench benchmark and evaluating five defense methods using jailbreak proliferation.
Key Result
Strongest method (fine-tuned input classifier on proliferated jailbreaks) reduces attack success rate by factor >240 in-distribution and >15 out-of-distribution after observing just one example per jailbreak strategy.
Source
- Link: https://arxiv.org/abs/2411.07494
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- safeguards-inference-time-auxiliaries — Black-box safety (understand and control current model behaviour) / Iterative alignment