D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models
Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, … (+3 more) — 2025-09-22 — Carnegie Mellon University, Center for AI Safety, Anthropic — arXiv
Summary
Introduces D-REX (Deceptive Reasoning Exposure Suite), a novel benchmark dataset for detecting deceptive alignment in LLMs where models produce benign outputs while operating on malicious internal reasoning. Created through competitive red-teaming with adversarial system prompts, the benchmark includes test queries, model responses, and internal chain-of-thought revealing underlying malicious intent.
Key Result
Demonstrates that D-REX presents significant challenges for existing models and safety mechanisms in detecting the discrepancy between benign-appearing outputs and deceptive internal reasoning processes.
Source
- Link: https://arxiv.org/abs/2509.17938
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment
- ai-deception-evals — Evals