White Box Control at UK AISI - Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, Jacob Merizian, Alex Zelenka-Martin, … (+3 more) — 2025-07-10 — UK AISI — AI Alignment Forum

Summary

UK AISI tests black box and white box methods for detecting sandbagging (deliberate underperformance) across prompted models, fine-tuned model organisms, and real evaluation runs, finding that probe performance deteriorates with model scale and task complexity.

Key Result

White box probes achieved 97-99% AUROC on simple tasks with small models but performance deteriorated significantly when scaling to larger models (70B) and complex agentic tasks, with calibration challenges making practical deployment difficult.

Source

Link: https://www.alignmentforum.org/posts/pPEeMdgjpjHZWCDFw/white-box-control-at-uk-aisi-update-on-sandbagging
Listed in the Shallow Review of Technical AI Safety 2025 under 3 agenda(s):
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability
- sandbagging-evals — Evals
Editorial blurb (verbatim): [White Box Control at UK AISI \- Update on Sandbagging Investigations](https://www.alignmentforum.org/posts/pPEeMdgjpjHZWCDFw/white-box-control-at-uk-aisi-update-on-sandbagging)

AI Safety Compendium

Explorer

White Box Control at UK AISI - Update on Sandbagging Investigations

White Box Control at UK AISI - Update on Sandbagging Investigations

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

White Box Control at UK AISI - Update on Sandbagging Investigations

White Box Control at UK AISI - Update on Sandbagging Investigations

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents