Mitigating Many-Shot Jailbreaking
Christopher M. Ackerman, Nina Panickssery — 2025-04-13 — arXiv
Summary
Empirically tests fine-tuning and input sanitization approaches for defending against many-shot jailbreaking attacks, finding that combined techniques significantly reduce attack effectiveness while preserving model performance on benign tasks.
Key Result
Combined fine-tuning and input sanitization techniques significantly reduce the effectiveness of many-shot jailbreaking attacks while retaining model performance in benign in-context learning and conversational tasks.
Source
- Link: https://arxiv.org/abs/2504.09604
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals