Difficulties with Evaluating a Deception Detector for AIs
Lewis Smith, Bilal Chughtai, Neel Nanda — 2025-11-27 — Google — arXiv
Summary
Identifies fundamental obstacles in constructing evaluations for AI deception detectors, arguing that distinguishing strategic deception from simpler behaviors requires attributing beliefs and goals to models, which is difficult or underdetermined for current language models.
Source
- Link: https://arxiv.org/html/2511.22662v1
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- google-deepmind — Labs (giant companies)