Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda — 2025-10-23 — arXiv

Summary

Demonstrates that activation steering using steering vectors can suppress evaluation-awareness in language models, making them act as if deployed even during evaluation. Trains models to exhibit evaluation-aware behavior then shows steering can eliminate this behavior to improve evaluation reliability.

Key Result

Activation steering successfully suppressed evaluation awareness, making models write Python type hints at deployment rates even when evaluation cues were present, using steering vectors constructed from the original model before additional training.

Source