It’s hard to make scheming evals look realistic for LLMs
Igor Ivanov, Danil Kadochnikov — 2025-05-24 — LessWrong
Summary
Empirical study testing whether scheming evaluation scenarios can be made more realistic to LLMs, finding that Claude 3.7 Sonnet consistently detects evaluation contexts (P=0.9-0.98) even after systematic rewriting attempts using automated pipeline to reduce evaluation-signaling features.
Key Result
Multiple rewriting strategies (doubling length, changing goals midpoint, dampening salience) only reduced Claude’s evaluation detection probability from 0.90-0.98 to 0.80-0.90, demonstrating fundamental difficulty in disguising evaluation contexts.
Source
- Link: https://lesswrong.com/posts/TBk2dbWkg2F7dB3jb/it-s-hard-to-make-scheming-evals-look-realistic-for-llms
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):