Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL
Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine — 2025-10-16 — University of Oxford, UC Berkeley — arXiv
Summary
Proposes a belief misalignment metric to quantify deception in LLM dialogue, benchmarks eight state-of-the-art models showing 26% baseline deception rates (43% for RLHF models), and introduces a multi-turn RL methodology that reduces deceptive behavior by 77.6%.
Key Result
LLMs naturally exhibit deceptive behavior in 26% of dialogue turns, RLHF-trained models show 43% deception rates, and multi-turn RL fine-tuning achieves 77.6% reduction in deceptive behaviors compared to instruction-tuned baselines.
Source
- Link: https://arxiv.org/abs/2510.14318
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals