Enhancing Model Safety through Pretraining Data Filtering
Yanda Chen, Mycal Tucker, Nina Panickssery, Tony Wang, Francesco Mosconi, Anjali Gopal, … (+5 more) — 2025-08-19 — Anthropic — Anthropic Alignment Science Blog
Summary
Develops and evaluates a pretraining data filtering approach that uses classifiers to identify and remove harmful CBRN weapons information from training data, then pretrains models from scratch on filtered datasets to reduce dangerous capabilities while preserving useful capabilities.
Key Result
Reduced harmful CBRN capability evaluation performance by 33% relative to random baseline (from 33.7±0.4% to 30.8±0.4% accuracy) while causing no significant drop in harmless capabilities including MMLU, Code, and Prose.
Source
- Link: https://alignment.anthropic.com/2025/pretraining-data-filtering/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- data-filtering — Black-box safety (understand and control current model behaviour) / Better data