Reasoning Models Don’t Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, … (+9 more) — 2025-05-08 — OpenAI, Anthropic — arXiv

Summary

Evaluates chain-of-thought faithfulness in state-of-the-art reasoning models by testing whether they verbalize reasoning hints they actually use, finding that CoTs reveal hint usage in at least 1% but often below 20% of cases where hints are used.

Key Result

CoT monitoring shows promise for noticing undesired behaviors during training but is insufficient to rule them out, with reveal rates often below 20% even when models use reasoning hints.

Source

Link: https://arxiv.org/abs/2505.05410
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment

chain-of-thought-monitoring

AI Safety Compendium

Explorer

Reasoning Models Don't Always Say What They Think

Reasoning Models Don’t Always Say What They Think

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Reasoning Models Don't Always Say What They Think

Reasoning Models Don’t Always Say What They Think

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents