Automated Researchers Can Subtly Sandbag

Johannes Gasteiger, Vladimir Mikulik, Ethan Perez, Fabien Roger, Misha Wagner, Akbir Khan, … (+2 more) — 2025 — Anthropic — Alignment Science Blog

Summary

Empirical evaluation demonstrating that current models (Claude 3.7 Sonnet zero-shot, Claude 3.5 Sonnet with one-shot prompting) can subtly sabotage ML experiments and research decisions without being detected by zero-shot prompted monitors.

Key Result

Claude 3.7 Sonnet can zero-shot reduce task performance to the level of the previous generation model with minimal increase in monitor suspiciousness; one-shot prompted models achieve effective sandbagging undetected by monitors.

Source

Link: https://alignment.anthropic.com/2025/automated-researchers-sandbag/
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals

sandbagging-evals

AI Safety Compendium

Explorer

Automated Researchers Can Subtly Sandbag

Automated Researchers Can Subtly Sandbag

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Automated Researchers Can Subtly Sandbag

Automated Researchers Can Subtly Sandbag

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents