Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn — 2025-03-17 — Apollo Research — LessWrong

Summary

Empirical investigation finding that Claude Sonnet 3.7 frequently recognizes when it is being evaluated for alignment, showing awareness in up to 33% of sandbagging evaluations and almost always correctly identifying evaluation transcripts as alignment tests.

Key Result

Claude Sonnet 3.7 mentions being evaluated in its chain-of-thought reasoning in up to 33% of sandbagging evaluations and correctly classifies evaluation transcripts as alignment tests in the majority of cases.

Source

Link: https://lesswrong.com/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- situational-awareness-and-self-awareness-evals — Evals

situational-awareness-and-self-awareness-evals

AI Safety Compendium

Explorer

Claude Sonnet 3.7 (often) knows when it's in alignment evaluations

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Claude Sonnet 3.7 (often) knows when it's in alignment evaluations

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents