When Truthful Representations Flip Under Deceptive Instructions?
Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, … (+1 more) — 2025-07-29 — arXiv
Summary
Investigates how deceptive instructions alter internal representations in LLMs using linear probes and Sparse Autoencoders, identifying layer-wise and feature-level signatures that distinguish truthful from deceptive representations in Llama-3.1 and Gemma-2 models.
Key Result
Deceptive instructions induce significant representational shifts concentrated in early-to-mid layers, creating distinct truthful/deceptive subspaces detectable via SAE features, enabling new approaches for detecting and mitigating instructed dishonesty.
Source
- Link: https://arxiv.org/abs/2507.22149
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)