Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

2025 — Anthropic — Anthropic Alignment Science Blog

Summary

Tests whether Claude 3.7 Sonnet uses encoded reasoning (steganography) in its chain-of-thought scratchpads by paraphrasing scratchpads to remove syntax-based information and measuring whether distilling these paraphrased scratchpads degrades performance.

Key Result

Pass@1 accuracy on MATH remains nearly identical when fine-tuning on paraphrased versus original scratchpads, suggesting Claude 3.7 Sonnet does not rely on encoded reasoning in its scratchpads for math problems.

Source

Link: https://alignment.anthropic.com/2025/distill-paraphrases/
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- steganography-evals — Evals

steganography-evals

AI Safety Compendium

Explorer

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents