D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, … (+3 more) — 2025-09-22 — Carnegie Mellon University, Center for AI Safety, Anthropic — arXiv

Summary

Introduces D-REX (Deceptive Reasoning Exposure Suite), a novel benchmark dataset for detecting deceptive alignment in LLMs where models produce benign outputs while operating on malicious internal reasoning. Created through competitive red-teaming with adversarial system prompts, the benchmark includes test queries, model responses, and internal chain-of-thought revealing underlying malicious intent.

Key Result

Demonstrates that D-REX presents significant challenges for existing models and safety mechanisms in detecting the discrepancy between benign-appearing outputs and deceptive internal reasoning processes.

Source