White Box Control at UK AISI - Update on Sandbagging Investigations
Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, Jacob Merizian, Alex Zelenka-Martin, … (+3 more) — 2025-07-10 — UK AISI — AI Alignment Forum
Summary
UK AISI tests black box and white box methods for detecting sandbagging (deliberate underperformance) across prompted models, fine-tuned model organisms, and real evaluation runs, finding that probe performance deteriorates with model scale and task complexity.
Key Result
White box probes achieved 97-99% AUROC on simple tasks with small models but performance deteriorated significantly when scaling to larger models (70B) and complex agentic tasks, with calibration challenges making practical deployment difficult.
Source
- Link: https://www.alignmentforum.org/posts/pPEeMdgjpjHZWCDFw/white-box-control-at-uk-aisi-update-on-sandbagging
- Listed in the Shallow Review of Technical AI Safety 2025 under 3 agenda(s):
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability
- sandbagging-evals — Evals
- Editorial blurb (verbatim):
[White Box Control at UK AISI \- Update on Sandbagging Investigations](https://www.alignmentforum.org/posts/pPEeMdgjpjHZWCDFw/white-box-control-at-uk-aisi-update-on-sandbagging)