Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
2025 — Anthropic — Anthropic Alignment Science Blog
Summary
Tests whether Claude 3.7 Sonnet uses encoded reasoning (steganography) in its chain-of-thought scratchpads by paraphrasing scratchpads to remove syntax-based information and measuring whether distilling these paraphrased scratchpads degrades performance.
Key Result
Pass@1 accuracy on MATH remains nearly identical when fine-tuning on paraphrased versus original scratchpads, suggesting Claude 3.7 Sonnet does not rely on encoded reasoning in its scratchpads for math problems.
Source
- Link: https://alignment.anthropic.com/2025/distill-paraphrases/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- steganography-evals — Evals