When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Kai Wang, Yihao Zhang, Meng Sun — 2025-06-05 — arXiv

Summary

Uses representation engineering to systematically induce, detect, and control strategic deception in chain-of-thought reasoning models, extracting ‘deception vectors’ via Linear Artificial Tomography (LAT) for 89% detection accuracy and achieving 40% success in eliciting context-appropriate deception through activation steering.

Key Result

Achieved 89% detection accuracy for strategic deception using LAT-extracted deception vectors and 40% success rate in eliciting context-appropriate deception through activation steering.

Source

Link: https://arxiv.org/abs/2506.04909
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)

extracting-latent-knowledge

AI Safety Compendium

Explorer

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents