Evaluating Language Model Reasoning about Confidential Information
Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter — 2025-08-27 — Carnegie Mellon University — arXiv
Summary
Develops PasswordEval benchmark to measure whether language models can correctly determine when user requests are authorized based on context-dependent rules (password verification), finding that frontier models struggle with this task and that reasoning traces frequently leak confidential information.
Key Result
Current open- and closed-source models struggle with contextual robustness to follow password-based authorization rules, and reasoning traces frequently leak confidential information, calling into question whether reasoning traces should be exposed in high-stakes applications.
Source
- Link: https://arxiv.org/abs/2508.19980
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-evals — Evals