D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, … (+3 more) — 2025-09-22 — Carnegie Mellon University, Center for AI Safety, Anthropic — arXiv

Summary

Introduces D-REX (Deceptive Reasoning Exposure Suite), a novel benchmark dataset for detecting deceptive alignment in LLMs where models produce benign outputs while operating on malicious internal reasoning. Created through competitive red-teaming with adversarial system prompts, the benchmark includes test queries, model responses, and internal chain-of-thought revealing underlying malicious intent.

Key Result

Demonstrates that D-REX presents significant challenges for existing models and safety mechanisms in detecting the discrepancy between benign-appearing outputs and deceptive internal reasoning processes.

Source

Link: https://arxiv.org/abs/2509.17938
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment
- ai-deception-evals — Evals

AI Safety Compendium

Explorer

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents