Liars’ Bench: Evaluating Lie Detectors for Language Models
Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks — 2025-11-20 — Cadenza Labs, Anthropic, FZI, University of Cambridge — arXiv
Summary
Introduces Liars’ Bench, a benchmark of 72,863 examples of lies and honest responses across seven datasets that capture qualitatively different types of lies by LLMs, and evaluates three lie detection methods finding systematic failures on certain lie types.
Key Result
LLM-as-a-Judge achieved highest balanced accuracy (0.73), but all tested methods failed to achieve above-chance performance across all datasets, with near-zero recall on knowledge-report and gender-secret datasets.
Source
- Link: https://arxiv.org/html/2511.16035v1
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)