White Box Control at UK AISI - Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, Jacob Merizian, Alex Zelenka-Martin, … (+3 more) — 2025-07-10 — UK AISI — AI Alignment Forum

Summary

UK AISI tests black box and white box methods for detecting sandbagging (deliberate underperformance) across prompted models, fine-tuned model organisms, and real evaluation runs, finding that probe performance deteriorates with model scale and task complexity.

Key Result

White box probes achieved 97-99% AUROC on simple tasks with small models but performance deteriorated significantly when scaling to larger models (70B) and complex agentic tasks, with calibration challenges making practical deployment difficult.

Source