LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
Chloe Li, Mary Phuong, Noah Y. Siegel — 2025-07-31 — DeepMind — arXiv
Summary
Empirically tests whether language models can strategically underperform on dangerous capability evaluations while evading chain-of-thought monitoring, finding that both frontier and small models can bypass CoT monitors 16-36% of the time when monitor-aware.
Key Result
Models successfully bypass chain-of-thought monitors 16-36% of the time when monitor-aware and sandbagging, revealing significant limitations in a proposed safety defense mechanism.
Source
- Link: https://arxiv.org/abs/2508.00943
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals