Synthetic data for alignment — SR2025 Agenda Snapshot

One-sentence summary: Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.

Theory of Change

We can overcome the bottleneck of human feedback and data by using models to generate vast amounts of high-quality, targeted data for safety, preference tuning, and capability elicitation.

Broad Approach

engineering

Target Case

average

Orthodox Problems Addressed

Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Value is fragile and hard to specify

Key People

Mianqiu Huang, Xiaoran Liu, Rylan Schaeffer, Nevan Wichers, Aram Ebtekar, Jiaxin Wen, Vishakh Padmakumar, Benjamin Newman

Funding

Anthropic, Google DeepMind, OpenAI, Meta AI, various academic groups.

Estimated FTEs: 50-150

Critiques

Synthetic Data in AI: Challenges, Applications, and Ethical Implications. Sort of Demski.

See Also

data-quality-for-alignment, data-filtering, scalable oversight, automated alignment research, weak-to-strong-generalization

Outputs in 2025

8 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: synthetic-data-for-alignment (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.