Automated Researchers Can Subtly Sandbag
Johannes Gasteiger, Vladimir Mikulik, Ethan Perez, Fabien Roger, Misha Wagner, Akbir Khan, … (+2 more) — 2025 — Anthropic — Alignment Science Blog
Summary
Empirical evaluation demonstrating that current models (Claude 3.7 Sonnet zero-shot, Claude 3.5 Sonnet with one-shot prompting) can subtly sabotage ML experiments and research decisions without being detected by zero-shot prompted monitors.
Key Result
Claude 3.7 Sonnet can zero-shot reduce task performance to the level of the previous generation model with minimal increase in monitor suspiciousness; one-shot prompted models achieve effective sandbagging undetected by monitors.
Source
- Link: https://alignment.anthropic.com/2025/automated-researchers-sandbag/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals