Evaluating Control Protocols for Untrusted AI Agents
Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, … (+1 more) — 2025-11-04 — Redwood Research — arXiv
Summary
Systematically evaluates control protocols for untrusted AI agents in SHADE-Arena, testing blue team strategies (deferral, resampling, deferring on critical actions) against default and adaptive red team attacks to assess protocol robustness.
Key Result
Deferring on critical actions achieves 96% safety and remains highly robust to adaptive attacks, while resampling strategies are vulnerable to attacks with protocol knowledge (safety drops from 96% to 17%).
Source
- Link: https://arxiv.org/abs/2511.02997
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment