Model values / model preferences — SR2025 Agenda Snapshot
One-sentence summary: Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.
Theory of Change
As AIs become more agentic, their behaviours and risks are increasingly determined by their goals and values. Since coherent value systems emerge with scale, we must leverage utility functions to analyse these values and apply “utility control” methods to constrain them, rather than just controlling outputs downstream of them.
Broad Approach
cognitivist science
Target Case
pessimistic
Orthodox Problems Addressed
Value is fragile and hard to specify
Key People
Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Funding
Coefficient Giving. $289,000 SFF funding for CAIS.
Estimated FTEs: 30
Critiques
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
See Also
Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions, Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Outputs in 2025
14 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: model-values-model-preferences (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Model values / model preferences) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- assistance-games-assistive-agents
- black-box-make-ai-solve-it
- capability-removal-unlearning
- chain-of-thought-monitoring
- character-training-and-persona-steering
- control
- data-filtering
- data-poisoning-defense
- data-quality-for-alignment
- emergent-misalignment
- harm-reduction-for-open-weights
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- iterative-alignment-at-post-train-time
- iterative-alignment-at-pretrain-time
- mild-optimisation
- model-psychopathology
- model-specs-and-constitutions
- rl-safety
- safeguards-inference-time-auxiliaries
- synthetic-data-for-alignment
- the-neglected-approaches-approach
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]