It’s hard to make scheming evals look realistic for LLMs

Igor Ivanov, Danil Kadochnikov — 2025-05-24 — LessWrong

Summary

Empirical study testing whether scheming evaluation scenarios can be made more realistic to LLMs, finding that Claude 3.7 Sonnet consistently detects evaluation contexts (P=0.9-0.98) even after systematic rewriting attempts using automated pipeline to reduce evaluation-signaling features.

Key Result

Multiple rewriting strategies (doubling length, changing goals midpoint, dampening salience) only reduced Claude’s evaluation detection probability from 0.90-0.98 to 0.80-0.90, demonstrating fundamental difficulty in disguising evaluation contexts.

Source

Link: https://lesswrong.com/posts/TBk2dbWkg2F7dB3jb/it-s-hard-to-make-scheming-evals-look-realistic-for-llms
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- situational-awareness-and-self-awareness-evals — Evals

situational-awareness-and-self-awareness-evals

AI Safety Compendium

Explorer

It's hard to make scheming evals look realistic for LLMs

It’s hard to make scheming evals look realistic for LLMs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

It's hard to make scheming evals look realistic for LLMs

It’s hard to make scheming evals look realistic for LLMs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents