Large Language Models Often Know When They Are Being Evaluated

Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn — 2025-05-28 — arXiv

Summary

Tests whether frontier language models can detect when they are being evaluated by constructing a benchmark of 1,000 prompts from 61 datasets spanning evaluations and deployment contexts, measuring models’ ability to classify transcripts as originating from evaluations versus real-world use.

Key Result

Gemini-2.5-Pro achieves AUC of 0.83 in distinguishing evaluation from deployment transcripts, demonstrating substantial but not yet superhuman evaluation awareness that could compromise evaluation reliability.

Source

Link: https://arxiv.org/abs/2505.23836
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- situational-awareness-and-self-awareness-evals — Evals

situational-awareness-and-self-awareness-evals

AI Safety Compendium

Explorer

Large Language Models Often Know When They Are Being Evaluated

Large Language Models Often Know When They Are Being Evaluated

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Large Language Models Often Know When They Are Being Evaluated

Large Language Models Often Know When They Are Being Evaluated

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents