Caught in the Act: a mechanistic approach to detecting deception
Gerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval — 2025-08-27 — arXiv
Summary
Demonstrates that linear probes trained on LLMs’ internal activations can detect deceptive responses with over 90% accuracy across multiple models (llama, qwen, DeepSeek variants from 1.5B to 14B parameters), and identifies multiple linear directions encoding deception using iterative null space projection.
Key Result
Linear probes achieve >90% accuracy detecting deception in larger reasoning models (>7B parameters), with accuracy following a three-stage pattern peaking in middle layers, and 20-100 deception-encoding directions identified depending on model size.
Source
- Link: https://arxiv.org/abs/2508.19505
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)