Model values / model preferences — SR2025 Agenda Snapshot

One-sentence summary: Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.

Theory of Change

As AIs become more agentic, their behaviours and risks are increasingly determined by their goals and values. Since coherent value systems emerge with scale, we must leverage utility functions to analyse these values and apply “utility control” methods to constrain them, rather than just controlling outputs downstream of them.

Broad Approach

cognitivist science

Target Case

pessimistic

Orthodox Problems Addressed

Value is fragile and hard to specify

Key People

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks

Funding

Coefficient Giving. $289,000 SFF funding for CAIS.

Estimated FTEs: 30

Critiques

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

Outputs in 2025

14 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: model-values-model-preferences (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = Model values / model preferences) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Model values / model preferences

Model values / model preferences — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Model values / model preferences

Model values / model preferences — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks