Data filtering — SR2025 Agenda Snapshot

One-sentence summary: Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.

Theory of Change

By curating the pretraining data, we can prevent the model from learning dangerous capabilities (e.g., dual-use info) or undesirable behaviors (e.g., toxicity) in the first place, making safety more robust and “tamper-resistant” than post-training patches.

Broad Approach

engineering

Target Case

average

Orthodox Problems Addressed

Goals misgeneralize out of distribution, Value is fragile and hard to specify

Key People

Yanda Chen, Pratyush Maini, Kyle O’Brien, Stephen Casper, Simon Pepin Lehalleur, Jesse Hoogland, Himanshu Beniwal, Sachin Goyal, Mycal Tucker, Dylan Sam

Funding

Anthropic, various academics

Estimated FTEs: 10-50

Critiques

When Bad Data Leads to Good Models, Medical large language models are vulnerable to data-poisoning attacks

Outputs in 2025

4 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: data-filtering (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = Data filtering) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Data filtering

Data filtering — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Data filtering

Data filtering — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks