Other evals — SR2025 Agenda Snapshot

One-sentence summary: A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.

Theory of Change

By developing novel benchmarks for specific, hard-to-measure properties (like honesty), critiquing the reliability of existing methods (like cultural surveys), and improving the formal rigor of evaluation systems (like LLM-as-Judges), researchers can create a more robust and comprehensive suite of evaluations to catch failures missed by standard capability or safety testing.

Broad Approach

behaviorist science

Target Case

average

Key People

Richard Ren, Mantas Mazeika, Andrés Corrada-Emmanuel, Ariba Khan, Stephen Casper

Funding

Lab funders (OpenAI), Open Philanthropy (which funds CAIS, the organization for the MASK benchmark), academic institutions. N/A (as a discrete amount). This work is part of the “tens of millions” budgets for broader evaluation and red-teaming efforts at labs and independent organizations.

Estimated FTEs: 20-50

Critiques

The Unreliability of Evaluating Cultural Alignment in LLMs, The Leaderboard Illusion

Outputs in 2025

20 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: other-evals (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = Other evals) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Other evals

Other evals — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Other evals

Other evals — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks