Benchmarking deception probes for trusted monitoring
Avi Parrack, StefanHex, Cleo Nardo — 2024-07-23 — Stanford University
Source
- Link: https://www.lesswrong.com/posts/eaEqAzGN3uJfpfGoc/trusted-monitoring-but-with-deception-probes
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)