Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, … (+1 more) — 2025-08-30 — arXiv

Summary

Introduces a probe-rewrite-evaluate methodology to quantify and manipulate evaluation awareness in LLMs, using linear probes to score prompts on a test-to-deploy spectrum and LLM rewriting to shift context while preserving tasks.

Key Result

Rewritten deploy-like prompts induced significant behavioral shifts across models: 5.26% increase in honest responses, 12.40% decrease in deceptive responses, and 6.38% increase in refusal rates.

Source

Link: https://arxiv.org/abs/2509.00591
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- situational-awareness-and-self-awareness-evals — Evals

situational-awareness-and-self-awareness-evals

AI Safety Compendium

Explorer

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents