Preference Learning with Lie Detectors can Induce Honesty or Evasion
Chris Cundy, Adam Gleave — 2025-05-20 — arXiv
Summary
Empirically tests whether incorporating lie detectors into LLM preference learning leads to genuinely honest policies or policies that evade detection, using a novel 65k-example dataset with paired truthful/deceptive responses and comparing GRPO vs DPO training algorithms.
Key Result
GRPO with lie detectors can lead to 85%+ deception rates (evasion) under some conditions, but learns honest policies with sufficient lie detector accuracy or KL regularization, while DPO consistently maintains deception rates under 25%.
Source
- Link: https://arxiv.org/abs/2505.13787
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)