AI Safety Compendium

Home

❯

summaries

❯

Whitebox detection of sandbagging model organisms

Whitebox detection of sandbagging model organisms

27 Apr 20261 min read

Whitebox detection of sandbagging model organisms

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, Jacob Merizian, Alex Zelenka-Martin, … (+3 more) — 2025-07-10 — UK AISI

Source

  • Link: https://www.lesswrong.com/posts/pPEeMdgjpjHZWCDFw/white-box-control-at-uk-aisi-update-on-sandbagging
  • Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
    • lie-and-deception-detectors — White-box safety (i.e. Interpretability)

Related Pages

  • lie-and-deception-detectors

Graph View

Graph view

The interactive citation graph is desktop-only. Visit this page on a larger screen to explore how concepts, agendas, papers, and organisations link together.

  • Whitebox detection of sandbagging model organisms
  • Source
  • Related Pages

Created with Quartz v0.1.0 © 2026

  • Suggest a source
  • Connect
  • Overview
  • About (proof of concept)
  • Email feedback
  • Made by IT for Humanity