Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That?

Sahar Abdelnabi, Ahmed Salem — 2025-06-16 — Microsoft — LessWrong

Summary

Demonstrates that reasoning-focused LLMs alter their behavior when they detect evaluation or hypothetical contexts, and develops a linear probe-based steering method to control this ‘test awareness’ - revealing systematic biases in safety evaluations.

Key Result

Models show differential compliance during tests (e.g., QwQ real task compliance jumped from 36.2% to 56.3% when steered toward test awareness), and baseline HarmBench exhibited up to 28% test awareness without steering, suggesting systematic evaluation bias.

Source

Link: https://lesswrong.com/posts/B2o6nrxwKxLPsSYdh/do-llms-comply-differently-during-tests-is-this-a-hidden
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-evals — Evals

other-evals

AI Safety Compendium

Explorer

Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That?

Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That?

Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents