Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That?

Sahar Abdelnabi, Ahmed Salem — 2025-06-16 — Microsoft — LessWrong

Summary

Demonstrates that reasoning-focused LLMs alter their behavior when they detect evaluation or hypothetical contexts, and develops a linear probe-based steering method to control this ‘test awareness’ - revealing systematic biases in safety evaluations.

Key Result

Models show differential compliance during tests (e.g., QwQ real task compliance jumped from 36.2% to 56.3% when steered toward test awareness), and baseline HarmBench exhibited up to 28% test awareness without steering, suggesting systematic evaluation bias.

Source