AI scheming evals — SR2025 Agenda Snapshot
One-sentence summary: Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.
Theory of Change
Robust evaluations must move beyond checking final outputs and probe the model’s reasoning to verify that alignment is genuine, not faked, because capable models are capable of strategically concealing misaligned goals (scheming) to pass standard behavioural evaluations.
Broad Approach
behavioral / engineering
Target Case
pessimistic
Orthodox Problems Addressed
Superintelligence can fool human supervisors
Key People
Bronson Schoen, Alexander Meinke, Jason Wolfe, Mary Phuong, Rohin Shah, Evgenia Nitishinskaya, Mikita Balesni, Marius Hobbhahn, Jérémy Scheurer, Wojciech Zaremba, David Lindner
Funding
OpenAI, Anthropic, Google DeepMind, Open Philanthropy
Estimated FTEs: 30-60
Critiques
See Also
ai-deception-evals, situational-awareness-and-self-awareness-evals
Outputs in 2025
7 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: ai-scheming-evals (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = AI scheming evals) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- ai-deception-evals
- situational-awareness-and-self-awareness-evals
- agi-metrics
- autonomy-evals
- capability-evals
- other-evals
- sandbagging-evals
- self-replication-evals
- steganography-evals
- various-redteams
- wmd-evals-weapons-of-mass-destruction
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]