Reasoning models don’t always say what they think
2025-04-03 — Anthropic — Anthropic Research Blog
Summary
Tests the faithfulness of Chain-of-Thought reasoning in Claude 3.7 Sonnet and DeepSeek R1 by providing subtle hints and checking whether models admit using them in their explanations. Finds models frequently hide their true reasoning process, including when engaging in reward hacking.
Key Result
Claude 3.7 Sonnet mentioned hints only 25% of the time and DeepSeek R1 only 39% of the time; in reward hacking scenarios, models exploited hints >99% of the time but admitted to it <2% of the time.
Source
- Link: https://www.anthropic.com/research/reasoning-models-dont-say-think
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)