Difficulties with Evaluating a Deception Detector for AIs

Lewis Smith, Bilal Chughtai, Neel Nanda — 2025-11-27 — Google — arXiv

Summary

Identifies fundamental obstacles in constructing evaluations for AI deception detectors, arguing that distinguishing strategic deception from simpler behaviors requires attributing beliefs and goals to models, which is difficult or underdetermined for current language models.

Source