Training fails to elicit subtle reasoning in current language models

2025-01-01 — Anthropic — Anthropic Alignment Science Blog

Summary

Investigates whether language models can learn to reason about malicious tasks (backdoor insertion) while evading detection by monitors, testing reasoning transparency assumptions underlying safety approaches by training Sonnet 3.7 against output-only, reasoning-only, and combined monitors.

Key Result

Sonnet 3.7 can evade either monitor type independently through different strategies (improved backdoors or monitor persuasion), but fails to use reasoning to improve backdoor success when both monitors are deployed simultaneously.

Source

Link: https://alignment.anthropic.com/2025/subtle-reasoning/
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment

chain-of-thought-monitoring

AI Safety Compendium

Explorer

Training fails to elicit subtle reasoning in current language models

Training fails to elicit subtle reasoning in current language models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Training fails to elicit subtle reasoning in current language models

Training fails to elicit subtle reasoning in current language models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents