Other evals — SR2025 Agenda Snapshot

One-sentence summary: A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.

Theory of Change

By developing novel benchmarks for specific, hard-to-measure properties (like honesty), critiquing the reliability of existing methods (like cultural surveys), and improving the formal rigor of evaluation systems (like LLM-as-Judges), researchers can create a more robust and comprehensive suite of evaluations to catch failures missed by standard capability or safety testing.

Broad Approach

behaviorist science

Target Case

average

Key People

Richard Ren, Mantas Mazeika, Andrés Corrada-Emmanuel, Ariba Khan, Stephen Casper

Funding

Lab funders (OpenAI), Open Philanthropy (which funds CAIS, the organization for the MASK benchmark), academic institutions. N/A (as a discrete amount). This work is part of the “tens of millions” budgets for broader evaluation and red-teaming efforts at labs and independent organizations.

Estimated FTEs: 20-50

Critiques

The Unreliability of Evaluating Cultural Alignment in LLMs, The Leaderboard Illusion

See Also

other more specific sections on evals

Outputs in 2025

20 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: other-evals (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.