Difficulties with Evaluating a Deception Detector for AIs

Lewis Smith, Bilal Chughtai, Neel Nanda — 2025-11-27 — Google — arXiv

Summary

Identifies fundamental obstacles in constructing evaluations for AI deception detectors, arguing that distinguishing strategic deception from simpler behaviors requires attributing beliefs and goals to models, which is difficult or underdetermined for current language models.

Source

Link: https://arxiv.org/html/2511.22662v1
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- google-deepmind — Labs (giant companies)

google-deepmind

AI Safety Compendium

Explorer

Difficulties with Evaluating a Deception Detector for AIs

Difficulties with Evaluating a Deception Detector for AIs

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Difficulties with Evaluating a Deception Detector for AIs

Difficulties with Evaluating a Deception Detector for AIs

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents