Active Attacks: Red-teaming LLMs via Adaptive Environments
Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim — 2025-09-26 — Mila, KAIST — arXiv
Summary
Introduces Active Attacks, a novel RL-based red-teaming algorithm that adaptively generates diverse attack prompts by periodically fine-tuning the victim LLM, forcing the attacker to discover new vulnerabilities as exploited regions become less rewarding.
Key Result
Active Attacks improved cross-attack success rates against GFlowNets from 0.07% to 31.28% (greater than 400× relative gain) with only 6% increase in computation.
Source
- Link: https://arxiv.org/abs/2509.21947
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals