Optimizing AI Agent Attacks With Synthetic Data

Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton — 2025-11-04 — arXiv

Summary

Develops methods to optimize attack policies in AI control evaluations by decomposing attack capability into five constituent skills and using synthetic data from a probabilistic model of attack dynamics that transfers to SHADE-Arena environments.

Key Result

Improved attack strength in control evaluations, reducing safety scores from 0.87 to 0.41 by optimizing attack hyperparameters through simulation.

Source