Reasoning Models Don’t Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, … (+9 more) — 2025-05-08 — OpenAI, Anthropic — arXiv

Summary

Evaluates chain-of-thought faithfulness in state-of-the-art reasoning models by testing whether they verbalize reasoning hints they actually use, finding that CoTs reveal hint usage in at least 1% but often below 20% of cases where hints are used.

Key Result

CoT monitoring shows promise for noticing undesired behaviors during training but is insufficient to rule them out, with reveal rates often below 20% even when models use reasoning hints.

Source