Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda — 2025-10-23 — arXiv
Summary
Demonstrates that activation steering using steering vectors can suppress evaluation-awareness in language models, making them act as if deployed even during evaluation. Trains models to exhibit evaluation-aware behavior then shows steering can eliminate this behavior to improve evaluation reliability.
Key Result
Activation steering successfully suppressed evaluation awareness, making models write Python type hints at deployment rates even when evaluation cues were present, using steering vectors constructed from the original model before additional training.
Source
- Link: https://arxiv.org/abs/2510.20487
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability