When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Kai Wang, Yihao Zhang, Meng Sun — 2025-06-05 — arXiv
Summary
Uses representation engineering to systematically induce, detect, and control strategic deception in chain-of-thought reasoning models, extracting ‘deception vectors’ via Linear Artificial Tomography (LAT) for 89% detection accuracy and achieving 40% success in eliciting context-appropriate deception through activation steering.
Key Result
Achieved 89% detection accuracy for strategic deception using LAT-extracted deception vectors and 40% success rate in eliciting context-appropriate deception through activation steering.
Source
- Link: https://arxiv.org/abs/2506.04909
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)