STACK: Adversarial Attacks on LLM Safeguard Pipelines

Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, … (+2 more) — 2025-06-30 — arXiv

Summary

Develops and evaluates STACK (STaged AttaCK), a novel black-box attack procedure against LLM safeguard pipelines, achieving 71% attack success rate on the ClearHarm catastrophic misuse dataset and 33% in transfer settings.

Key Result

STACK achieves 71% attack success rate on ClearHarm against few-shot-prompted classifier pipeline in black-box setting, and 33% ASR in transfer setting with no access to target pipeline.

Source