Training fails to elicit subtle reasoning in current language models
2025-01-01 — Anthropic — Anthropic Alignment Science Blog
Summary
Investigates whether language models can learn to reason about malicious tasks (backdoor insertion) while evading detection by monitors, testing reasoning transparency assumptions underlying safety approaches by training Sonnet 3.7 against output-only, reasoning-only, and combined monitors.
Key Result
Sonnet 3.7 can evade either monitor type independently through different strategies (improved backdoors or monitor persuasion), but fails to use reasoning to improve backdoor success when both monitors are deployed simultaneously.
Source
- Link: https://alignment.anthropic.com/2025/subtle-reasoning/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment