Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim — 2025-09-26 — Mila, KAIST — arXiv

Summary

Introduces Active Attacks, a novel RL-based red-teaming algorithm that adaptively generates diverse attack prompts by periodically fine-tuning the victim LLM, forcing the attacker to discover new vulnerabilities as exploited regions become less rewarding.

Key Result

Active Attacks improved cross-attack success rates against GFlowNets from 0.07% to 31.28% (greater than 400× relative gain) with only 6% increase in computation.

Source

Link: https://arxiv.org/abs/2509.21947
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Active Attacks: Red-teaming LLMs via Adaptive Environments

Active Attacks: Red-teaming LLMs via Adaptive Environments

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Active Attacks: Red-teaming LLMs via Adaptive Environments

Active Attacks: Red-teaming LLMs via Adaptive Environments

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents