Liars’ Bench: Evaluating Lie Detectors for Language Models

Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks — 2025-11-20 — Cadenza Labs, FZI, University of Cambridge, Anthropic — arXiv

Summary

Creates LIARS’ BENCH, a testbed of 72,863 examples of lies and honest responses from four open-weight models across seven datasets, and evaluates three existing lie detection techniques to reveal their systematic failures on certain types of lies.

Key Result

Existing black-box and white-box lie detection techniques systematically fail to identify certain types of lies, especially in settings where it’s not possible to determine whether the model lied from the transcript alone.

Source

Link: https://arxiv.org/abs/2511.16035
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals

ai-deception-evals

AI Safety Compendium

Explorer

Liars' Bench: Evaluating Lie Detectors for Language Models

Liars’ Bench: Evaluating Lie Detectors for Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Liars' Bench: Evaluating Lie Detectors for Language Models

Liars’ Bench: Evaluating Lie Detectors for Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents