Other evals — SR2025 Agenda Snapshot
One-sentence summary: A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.
Theory of Change
By developing novel benchmarks for specific, hard-to-measure properties (like honesty), critiquing the reliability of existing methods (like cultural surveys), and improving the formal rigor of evaluation systems (like LLM-as-Judges), researchers can create a more robust and comprehensive suite of evaluations to catch failures missed by standard capability or safety testing.
Broad Approach
behaviorist science
Target Case
average
Key People
Richard Ren, Mantas Mazeika, Andrés Corrada-Emmanuel, Ariba Khan, Stephen Casper
Funding
Lab funders (OpenAI), Open Philanthropy (which funds CAIS, the organization for the MASK benchmark), academic institutions. N/A (as a discrete amount). This work is part of the “tens of millions” budgets for broader evaluation and red-teaming efforts at labs and independent organizations.
Estimated FTEs: 20-50
Critiques
The Unreliability of Evaluating Cultural Alignment in LLMs, The Leaderboard Illusion
See Also
other more specific sections on evals
Outputs in 2025
20 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: other-evals (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Other evals) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- agi-metrics
- ai-deception-evals
- ai-scheming-evals
- autonomy-evals
- capability-evals
- sandbagging-evals
- self-replication-evals
- situational-awareness-and-self-awareness-evals
- steganography-evals
- various-redteams
- wmd-evals-weapons-of-mass-destruction
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]