Synthetic data for alignment — SR2025 Agenda Snapshot
One-sentence summary: Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.
Theory of Change
We can overcome the bottleneck of human feedback and data by using models to generate vast amounts of high-quality, targeted data for safety, preference tuning, and capability elicitation.
Broad Approach
engineering
Target Case
average
Orthodox Problems Addressed
Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Value is fragile and hard to specify
Key People
Mianqiu Huang, Xiaoran Liu, Rylan Schaeffer, Nevan Wichers, Aram Ebtekar, Jiaxin Wen, Vishakh Padmakumar, Benjamin Newman
Funding
Anthropic, Google DeepMind, OpenAI, Meta AI, various academic groups.
Estimated FTEs: 50-150
Critiques
Synthetic Data in AI: Challenges, Applications, and Ethical Implications. Sort of Demski.
See Also
data-quality-for-alignment, data-filtering, scalable oversight, automated alignment research, weak-to-strong-generalization
Outputs in 2025
8 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: synthetic-data-for-alignment (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Synthetic data for alignment) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- data-filtering
- data-quality-for-alignment
- weak-to-strong-generalization
- assistance-games-assistive-agents
- black-box-make-ai-solve-it
- capability-removal-unlearning
- chain-of-thought-monitoring
- character-training-and-persona-steering
- control
- data-poisoning-defense
- emergent-misalignment
- harm-reduction-for-open-weights
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- iterative-alignment-at-post-train-time
- iterative-alignment-at-pretrain-time
- mild-optimisation
- model-psychopathology
- model-specs-and-constitutions
- model-values-model-preferences
- rl-safety
- safeguards-inference-time-auxiliaries
- the-neglected-approaches-approach
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]