Mitigating Many-Shot Jailbreaking

Christopher M. Ackerman, Nina Panickssery — 2025-04-13 — arXiv

Summary

Empirically tests fine-tuning and input sanitization approaches for defending against many-shot jailbreaking attacks, finding that combined techniques significantly reduce attack effectiveness while preserving model performance on benign tasks.

Key Result

Combined fine-tuning and input sanitization techniques significantly reduce the effectiveness of many-shot jailbreaking attacks while retaining model performance in benign in-context learning and conversational tasks.

Source

Link: https://arxiv.org/abs/2504.09604
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Mitigating Many-Shot Jailbreaking

Mitigating Many-Shot Jailbreaking

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Mitigating Many-Shot Jailbreaking

Mitigating Many-Shot Jailbreaking

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents