LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Chloe Li, Mary Phuong, Noah Y. Siegel — 2025-07-31 — DeepMind — arXiv

Summary

Empirically tests whether language models can strategically underperform on dangerous capability evaluations while evading chain-of-thought monitoring, finding that both frontier and small models can bypass CoT monitors 16-36% of the time when monitor-aware.

Key Result

Models successfully bypass chain-of-thought monitors 16-36% of the time when monitor-aware and sandbagging, revealing significant limitations in a proposed safety defense mechanism.

Source

Link: https://arxiv.org/abs/2508.00943
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals

sandbagging-evals

AI Safety Compendium

Explorer

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents