Sandbagging evals — SR2025 Agenda Snapshot

One-sentence summary: Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.

Theory of Change

If models can distinguish between evaluation and deployment contexts (“evaluation awareness”), they might learn to “sandbag” or deliberately underperform to hide dangerous capabilities, fooling safety evaluations. By developing evaluations for sandbagging, we can test whether our safety methods are being deceived and detect this behavior before a model is deployed.

Broad Approach

behaviorist science

Target Case

pessimistic

Orthodox Problems Addressed

Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

Key People

Teun van der Weij, Cameron Tice, Chloe Li, Johannes Gasteiger, Joseph Bloom, Joel Dyer

Funding

Anthropic (and its funders, e.g., Google, Amazon), UK Government (funding the AI Security Institute)

Estimated FTEs: 10-50

Critiques

The main external critique, from sources like “the void” and “Lessons from a Chimp”, is that this research “overattribut[es] human traits” to models. It argues that what’s being measured isn’t genuine sandbagging but models “playing-along-with-drama behaviour” in response to “artificial and contrived” evals.

Outputs in 2025

9 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: sandbagging-evals (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = Sandbagging evals) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Sandbagging evals

Sandbagging evals — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Sandbagging evals

Sandbagging evals — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks