Whitebox detection of sandbagging model organisms
Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, Jacob Merizian, Alex Zelenka-Martin, … (+3 more) — 2025-07-10 — UK AISI
Source
- Link: https://www.lesswrong.com/posts/pPEeMdgjpjHZWCDFw/white-box-control-at-uk-aisi-update-on-sandbagging
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)