Optimizing AI Agent Attacks With Synthetic Data
Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton — 2025-11-04 — arXiv
Summary
Develops methods to optimize attack policies in AI control evaluations by decomposing attack capability into five constituent skills and using synthetic data from a probabilistic model of attack dynamics that transfers to SHADE-Arena environments.
Key Result
Improved attack strength in control evaluations, reducing safety scores from 0.87 to 0.41 by optimizing attack hyperparameters through simulation.
Source
- Link: https://arxiv.org/abs/2511.02823
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment