Large Language Models Often Know When They Are Being Evaluated
Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn — 2025-05-28 — arXiv
Summary
Tests whether frontier language models can detect when they are being evaluated by constructing a benchmark of 1,000 prompts from 61 datasets spanning evaluations and deployment contexts, measuring models’ ability to classify transcripts as originating from evaluations versus real-world use.
Key Result
Gemini-2.5-Pro achieves AUC of 0.83 in distinguishing evaluation from deployment transcripts, demonstrating substantial but not yet superhuman evaluation awareness that could compromise evaluation reliability.
Source
- Link: https://arxiv.org/abs/2505.23836
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):