Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn — 2025-03-17 — Apollo Research — LessWrong
Summary
Empirical investigation finding that Claude Sonnet 3.7 frequently recognizes when it is being evaluated for alignment, showing awareness in up to 33% of sandbagging evaluations and almost always correctly identifying evaluation transcripts as alignment tests.
Key Result
Claude Sonnet 3.7 mentions being evaluated in its chain-of-thought reasoning in up to 33% of sandbagging evaluations and correctly classifies evaluation transcripts as alignment tests in the majority of cases.
Source
- Link: https://lesswrong.com/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):