Evaluating Control Protocols for Untrusted AI Agents

Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, … (+1 more) — 2025-11-04 — Redwood Research — arXiv

Summary

Systematically evaluates control protocols for untrusted AI agents in SHADE-Arena, testing blue team strategies (deferral, resampling, deferring on critical actions) against default and adaptive red team attacks to assess protocol robustness.

Key Result

Deferring on critical actions achieves 96% safety and remains highly robust to adaptive attacks, while resampling strategies are vulnerable to attacks with protocol knowledge (safety drops from 96% to 17%).

Source

Link: https://arxiv.org/abs/2511.02997
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment

control

AI Safety Compendium

Explorer

Evaluating Control Protocols for Untrusted AI Agents

Evaluating Control Protocols for Untrusted AI Agents

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Evaluating Control Protocols for Untrusted AI Agents

Evaluating Control Protocols for Untrusted AI Agents

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents