Evaluating Language Model Reasoning about Confidential Information

Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter — 2025-08-27 — Carnegie Mellon University — arXiv

Summary

Develops PasswordEval benchmark to measure whether language models can correctly determine when user requests are authorized based on context-dependent rules (password verification), finding that frontier models struggle with this task and that reasoning traces frequently leak confidential information.

Key Result

Current open- and closed-source models struggle with contextual robustness to follow password-based authorization rules, and reasoning traces frequently leak confidential information, calling into question whether reasoning traces should be exposed in high-stakes applications.

Source