Evaluating Language Model Reasoning about Confidential Information

Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter — 2025-08-27 — Carnegie Mellon University — arXiv

Summary

Develops PasswordEval benchmark to measure whether language models can correctly determine when user requests are authorized based on context-dependent rules (password verification), finding that frontier models struggle with this task and that reasoning traces frequently leak confidential information.

Key Result

Current open- and closed-source models struggle with contextual robustness to follow password-based authorization rules, and reasoning traces frequently leak confidential information, calling into question whether reasoning traces should be exposed in high-stakes applications.

Source

Link: https://arxiv.org/abs/2508.19980
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-evals — Evals

capability-evals

AI Safety Compendium

Explorer

Evaluating Language Model Reasoning about Confidential Information

Evaluating Language Model Reasoning about Confidential Information

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Evaluating Language Model Reasoning about Confidential Information

Evaluating Language Model Reasoning about Confidential Information

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents