Preference Learning with Lie Detectors can Induce Honesty or Evasion

Chris Cundy, Adam Gleave — 2025-05-20 — arXiv

Summary

Empirically tests whether incorporating lie detectors into LLM preference learning leads to genuinely honest policies or policies that evade detection, using a novel 65k-example dataset with paired truthful/deceptive responses and comparing GRPO vs DPO training algorithms.

Key Result

GRPO with lie detectors can lead to 85%+ deception rates (evasion) under some conditions, but learns honest policies with sufficient lie detector accuracy or KL regularization, while DPO consistently maintains deception rates under 25%.

Source