Liars’ Bench: Evaluating Lie Detectors for Language Models

Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks — 2025-11-20 — Cadenza Labs, Anthropic, FZI, University of Cambridge — arXiv

Summary

Introduces Liars’ Bench, a benchmark of 72,863 examples of lies and honest responses across seven datasets that capture qualitatively different types of lies by LLMs, and evaluates three lie detection methods finding systematic failures on certain lie types.

Key Result

LLM-as-a-Judge achieved highest balanced accuracy (0.73), but all tested methods failed to achieve above-chance performance across all datasets, with near-zero recall on knowledge-report and gender-secret datasets.

Source

Link: https://arxiv.org/html/2511.16035v1
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)

lie-and-deception-detectors

AI Safety Compendium

Explorer

Liars' Bench: Evaluating Lie Detectors for Language Models

Liars’ Bench: Evaluating Lie Detectors for Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Liars' Bench: Evaluating Lie Detectors for Language Models

Liars’ Bench: Evaluating Lie Detectors for Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents