Reasoning Models Don’t Always Say What They Think
Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, … (+9 more) — 2025-05-08 — OpenAI, Anthropic — arXiv
Summary
Evaluates chain-of-thought faithfulness in state-of-the-art reasoning models by testing whether they verbalize reasoning hints they actually use, finding that CoTs reveal hint usage in at least 1% but often below 20% of cases where hints are used.
Key Result
CoT monitoring shows promise for noticing undesired behaviors during training but is insufficient to rule them out, with reveal rates often below 20% even when models use reasoning hints.
Source
- Link: https://arxiv.org/abs/2505.05410
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment