Data quality for alignment — SR2025 Agenda Snapshot
One-sentence summary: Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.
Theory of Change
The quality of alignment is heavily dependent on the quality of the data (e.g., human preferences); by improving the “signal” from annotators and reducing noise/bias, we will get more robustly aligned models.
Broad Approach
engineering
Target Case
average
Orthodox Problems Addressed
Superintelligence can fool human supervisors, Value is fragile and hard to specify
Key People
Maarten Buyl, Kelsey Kraus, Margaret Kroll, Danqing Shi
Funding
Anthropic, Google DeepMind, OpenAI, Meta AI, various academic groups
Estimated FTEs: 20-50
Critiques
A Statistical Case Against Empirical Human-AI Alignment
See Also
synthetic-data-for-alignment, scalable oversight, assistance-games-assistive-agents, model-values-model-preferences
Outputs in 2025
5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: data-quality-for-alignment (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Data quality for alignment) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- assistance-games-assistive-agents
- model-values-model-preferences
- synthetic-data-for-alignment
- black-box-make-ai-solve-it
- capability-removal-unlearning
- chain-of-thought-monitoring
- character-training-and-persona-steering
- control
- data-filtering
- data-poisoning-defense
- emergent-misalignment
- harm-reduction-for-open-weights
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- iterative-alignment-at-post-train-time
- iterative-alignment-at-pretrain-time
- mild-optimisation
- model-psychopathology
- model-specs-and-constitutions
- rl-safety
- safeguards-inference-time-auxiliaries
- the-neglected-approaches-approach
- data-attribution
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]