REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann — 2025-02-24 — arXiv

Summary

Proposes a REINFORCE-based adversarial attack method for jailbreaking LLMs that optimizes over the full distribution of responses rather than just affirmative prefixes, demonstrating that current robustness evaluations may significantly overestimate model safety.

Key Result

The method doubles attack success rates on Llama3 and increases success from 2% to 50% against circuit breaker defenses, revealing previously hidden vulnerabilities in aligned models.

Source

Link: https://arxiv.org/abs/2502.17254
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents