Optimizing AI Agent Attacks With Synthetic Data

Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton — 2025-11-04 — arXiv

Summary

Develops methods to optimize attack policies in AI control evaluations by decomposing attack capability into five constituent skills and using synthetic data from a probabilistic model of attack dynamics that transfers to SHADE-Arena environments.

Key Result

Improved attack strength in control evaluations, reducing safety scores from 0.87 to 0.41 by optimizing attack hyperparameters through simulation.

Source

Link: https://arxiv.org/abs/2511.02823
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment

control

AI Safety Compendium

Explorer

Optimizing AI Agent Attacks With Synthetic Data

Optimizing AI Agent Attacks With Synthetic Data

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Optimizing AI Agent Attacks With Synthetic Data

Optimizing AI Agent Attacks With Synthetic Data

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents