Mitigating Many-Shot Jailbreaking

Christopher M. Ackerman, Nina Panickssery — 2025-04-13 — arXiv

Summary

Empirically tests fine-tuning and input sanitization approaches for defending against many-shot jailbreaking attacks, finding that combined techniques significantly reduce attack effectiveness while preserving model performance on benign tasks.

Key Result

Combined fine-tuning and input sanitization techniques significantly reduce the effectiveness of many-shot jailbreaking attacks while retaining model performance in benign in-context learning and conversational tasks.

Source