Situational awareness and self-awareness evals — SR2025 Agenda Snapshot
One-sentence summary: Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.
Theory of Change
If an AI can distinguish between evaluation and deployment (“evaluation awareness”), it might hide dangerous capabilities (scheming/sandbagging). By measuring self- and situational-awareness, we can better assess this risk and build more robust evaluations.
Broad Approach
behaviorist science
Target Case
worst-case
Orthodox Problems Addressed
Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Key People
Jan Betley, Xuchan Bao, Martín Soto, Mary Phuong, Roland S. Zimmermann, Joe Needham, Giles Edkins, Govind Pimpale, Kai Fronsdal, David Lindner, Lang Xiong, Xiaoyan Bai
Funding
frontier labs (Google DeepMind, Anthropic), Open Philanthropy, The Audacious Project, UK AI Safety Institute (AISI), AI Safety Support, Apollo Research, METR
Estimated FTEs: 30-70
Critiques
Lessons from a Chimp: AI “Scheming” and the Quest for Ape Language, It’s hard to make scheming evals look realistic for LLMs
See Also
sandbagging-evals, various-redteams, Model psychology
Outputs in 2025
11 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: situational-awareness-and-self-awareness-evals (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Situational awareness and self-awareness evals) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- sandbagging-evals
- various-redteams
- agi-metrics
- ai-deception-evals
- ai-scheming-evals
- autonomy-evals
- capability-evals
- other-evals
- self-replication-evals
- steganography-evals
- wmd-evals-weapons-of-mass-destruction
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]