Automated Researchers Can Subtly Sandbag

Johannes Gasteiger, Vladimir Mikulik, Ethan Perez, Fabien Roger, Misha Wagner, Akbir Khan, … (+2 more) — 2025 — Anthropic — Alignment Science Blog

Summary

Empirical evaluation demonstrating that current models (Claude 3.7 Sonnet zero-shot, Claude 3.5 Sonnet with one-shot prompting) can subtly sabotage ML experiments and research decisions without being detected by zero-shot prompted monitors.

Key Result

Claude 3.7 Sonnet can zero-shot reduce task performance to the level of the previous generation model with minimal increase in monitor suspiciousness; one-shot prompted models achieve effective sandbagging undetected by monitors.

Source