Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That?
Sahar Abdelnabi, Ahmed Salem — 2025-06-16 — Microsoft — LessWrong
Summary
Demonstrates that reasoning-focused LLMs alter their behavior when they detect evaluation or hypothetical contexts, and develops a linear probe-based steering method to control this ‘test awareness’ - revealing systematic biases in safety evaluations.
Key Result
Models show differential compliance during tests (e.g., QwQ real task compliance jumped from 36.2% to 56.3% when steered toward test awareness), and baseline HarmBench exhibited up to 28% test awareness without steering, suggesting systematic evaluation bias.
Source
- Link: https://lesswrong.com/posts/B2o6nrxwKxLPsSYdh/do-llms-comply-differently-during-tests-is-this-a-hidden
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-evals — Evals