When Truthful Representations Flip Under Deceptive Instructions?

Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, … (+1 more) — 2025-07-29 — arXiv

Summary

Investigates how deceptive instructions alter internal representations in LLMs using linear probes and Sparse Autoencoders, identifying layer-wise and feature-level signatures that distinguish truthful from deceptive representations in Llama-3.1 and Gemma-2 models.

Key Result

Deceptive instructions induce significant representational shifts concentrated in early-to-mid layers, creating distinct truthful/deceptive subspaces detectable via SAE features, enabling new approaches for detecting and mitigating instructed dishonesty.

Source