Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, … (+1 more) — 2025-08-30 — arXiv

Summary

Introduces a probe-rewrite-evaluate methodology to quantify and manipulate evaluation awareness in LLMs, using linear probes to score prompts on a test-to-deploy spectrum and LLM rewriting to shift context while preserving tasks.

Key Result

Rewritten deploy-like prompts induced significant behavioral shifts across models: 5.26% increase in honest responses, 12.40% decrease in deceptive responses, and 6.38% increase in refusal rates.

Source