Caught in the Act: a mechanistic approach to detecting deception

Gerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval — 2025-08-27 — arXiv

Summary

Demonstrates that linear probes trained on LLMs’ internal activations can detect deceptive responses with over 90% accuracy across multiple models (llama, qwen, DeepSeek variants from 1.5B to 14B parameters), and identifies multiple linear directions encoding deception using iterative null space projection.

Key Result

Linear probes achieve >90% accuracy detecting deception in larger reasoning models (>7B parameters), with accuracy following a three-stage pattern peaking in middle layers, and 20-100 deception-encoding directions identified depending on model size.

Source

Link: https://arxiv.org/abs/2508.19505
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)

extracting-latent-knowledge
lie-and-deception-detectors

AI Safety Compendium

Explorer

Caught in the Act: a mechanistic approach to detecting deception

Caught in the Act: a mechanistic approach to detecting deception

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Caught in the Act: a mechanistic approach to detecting deception

Caught in the Act: a mechanistic approach to detecting deception

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents