Data filtering — SR2025 Agenda Snapshot

One-sentence summary: Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.

Theory of Change

By curating the pretraining data, we can prevent the model from learning dangerous capabilities (e.g., dual-use info) or undesirable behaviors (e.g., toxicity) in the first place, making safety more robust and “tamper-resistant” than post-training patches.

Broad Approach

engineering

Target Case

average

Orthodox Problems Addressed

Goals misgeneralize out of distribution, Value is fragile and hard to specify

Key People

Yanda Chen, Pratyush Maini, Kyle O’Brien, Stephen Casper, Simon Pepin Lehalleur, Jesse Hoogland, Himanshu Beniwal, Sachin Goyal, Mycal Tucker, Dylan Sam

Funding

Anthropic, various academics

Estimated FTEs: 10-50

Critiques

When Bad Data Leads to Good Models, Medical large language models are vulnerable to data-poisoning attacks

See Also

data-quality-for-alignment, data-poisoning-defense, synthetic-data-for-alignment, capability-removal-unlearning

Outputs in 2025

4 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: data-filtering (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.