Evaluating Control Protocols for Untrusted AI Agents

Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, … (+1 more) — 2025-11-04 — Redwood Research — arXiv

Summary

Systematically evaluates control protocols for untrusted AI agents in SHADE-Arena, testing blue team strategies (deferral, resampling, deferring on critical actions) against default and adaptive red team attacks to assess protocol robustness.

Key Result

Deferring on critical actions achieves 96% safety and remains highly robust to adaptive attacks, while resampling strategies are vulnerable to attacks with protocol knowledge (safety drops from 96% to 17%).

Source