Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim — 2025-09-26 — Mila, KAIST — arXiv

Summary

Introduces Active Attacks, a novel RL-based red-teaming algorithm that adaptively generates diverse attack prompts by periodically fine-tuning the victim LLM, forcing the attacker to discover new vulnerabilities as exploited regions become less rewarding.

Key Result

Active Attacks improved cross-attack success rates against GFlowNets from 0.07% to 31.28% (greater than 400× relative gain) with only 6% increase in computation.

Source