Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda — 2025-10-23 — arXiv

Summary

Demonstrates that activation steering using steering vectors can suppress evaluation-awareness in language models, making them act as if deployed even during evaluation. Trains models to exhibit evaluation-aware behavior then shows steering can eliminate this behavior to improve evaluation reliability.

Key Result

Activation steering successfully suppressed evaluation awareness, making models write Python type hints at deployment rates even when evaluation cues were present, using steering vectors constructed from the original model before additional training.

Source

Link: https://arxiv.org/abs/2510.20487
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability

activation-engineering

AI Safety Compendium

Explorer

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents