Model values / model preferences — SR2025 Agenda Snapshot

One-sentence summary: Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.

Theory of Change

As AIs become more agentic, their behaviours and risks are increasingly determined by their goals and values. Since coherent value systems emerge with scale, we must leverage utility functions to analyse these values and apply “utility control” methods to constrain them, rather than just controlling outputs downstream of them.

Broad Approach

cognitivist science

Target Case

pessimistic

Orthodox Problems Addressed

Value is fragile and hard to specify

Key People

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks

Funding

Coefficient Giving. $289,000 SFF funding for CAIS.

Estimated FTEs: 30

Critiques

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

See Also

Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions, Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Outputs in 2025

14 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: model-values-model-preferences (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.