It’s hard to make scheming evals look realistic for LLMs

Igor Ivanov, Danil Kadochnikov — 2025-05-24 — LessWrong

Summary

Empirical study testing whether scheming evaluation scenarios can be made more realistic to LLMs, finding that Claude 3.7 Sonnet consistently detects evaluation contexts (P=0.9-0.98) even after systematic rewriting attempts using automated pipeline to reduce evaluation-signaling features.

Key Result

Multiple rewriting strategies (doubling length, changing goals midpoint, dampening salience) only reduced Claude’s evaluation detection probability from 0.90-0.98 to 0.80-0.90, demonstrating fundamental difficulty in disguising evaluation contexts.

Source