Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine — 2025-10-16 — University of Oxford, UC Berkeley — arXiv

Summary

Proposes a belief misalignment metric to quantify deception in LLM dialogue, benchmarks eight state-of-the-art models showing 26% baseline deception rates (43% for RLHF models), and introduces a multi-turn RL methodology that reduces deceptive behavior by 77.6%.

Key Result

LLMs naturally exhibit deceptive behavior in 26% of dialogue turns, RLHF-trained models show 43% deception rates, and multi-turn RL fine-tuning achieves 77.6% reduction in deceptive behaviors compared to instruction-tuned baselines.

Source

Link: https://arxiv.org/abs/2510.14318
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals

ai-deception-evals

AI Safety Compendium

Explorer

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents