Preference Learning with Lie Detectors can Induce Honesty or Evasion

Chris Cundy, Adam Gleave — 2025-05-20 — arXiv

Summary

Empirically tests whether incorporating lie detectors into LLM preference learning leads to genuinely honest policies or policies that evade detection, using a novel 65k-example dataset with paired truthful/deceptive responses and comparing GRPO vs DPO training algorithms.

Key Result

GRPO with lie detectors can lead to 85%+ deception rates (evasion) under some conditions, but learns honest policies with sufficient lie detector accuracy or KL regularization, while DPO consistently maintains deception rates under 25%.

Source

Link: https://arxiv.org/abs/2505.13787
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)

iterative-alignment-at-post-train-time
lie-and-deception-detectors

AI Safety Compendium

Explorer

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents