Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, … (+1 more) — 2025-08-30 — arXiv
Summary
Introduces a probe-rewrite-evaluate methodology to quantify and manipulate evaluation awareness in LLMs, using linear probes to score prompts on a test-to-deploy spectrum and LLM rewriting to shift context while preserving tasks.
Key Result
Rewritten deploy-like prompts induced significant behavioral shifts across models: 5.26% increase in honest responses, 12.40% decrease in deceptive responses, and 6.38% increase in refusal rates.
Source
- Link: https://arxiv.org/abs/2509.00591
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):