Large Language Models Often Know When They Are Being Evaluated

Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn — 2025-05-28 — arXiv

Summary

Tests whether frontier language models can detect when they are being evaluated by constructing a benchmark of 1,000 prompts from 61 datasets spanning evaluations and deployment contexts, measuring models’ ability to classify transcripts as originating from evaluations versus real-world use.

Key Result

Gemini-2.5-Pro achieves AUC of 0.83 in distinguishing evaluation from deployment transcripts, demonstrating substantial but not yet superhuman evaluation awareness that could compromise evaluation reliability.

Source