Caught in the Act: a mechanistic approach to detecting deception

Gerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval — 2025-08-27 — arXiv

Summary

Demonstrates that linear probes trained on LLMs’ internal activations can detect deceptive responses with over 90% accuracy across multiple models (llama, qwen, DeepSeek variants from 1.5B to 14B parameters), and identifies multiple linear directions encoding deception using iterative null space projection.

Key Result

Linear probes achieve >90% accuracy detecting deception in larger reasoning models (>7B parameters), with accuracy following a three-stage pattern peaking in middle layers, and 20-100 deception-encoding directions identified depending on model size.

Source