REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann — 2025-02-24 — arXiv

Summary

Proposes a REINFORCE-based adversarial attack method for jailbreaking LLMs that optimizes over the full distribution of responses rather than just affirmative prefixes, demonstrating that current robustness evaluations may significantly overestimate model safety.

Key Result

The method doubles attack success rates on Llama3 and increases success from 2% to 50% against circuit breaker defenses, revealing previously hidden vulnerabilities in aligned models.

Source