Enhancing Model Safety through Pretraining Data Filtering

Yanda Chen, Mycal Tucker, Nina Panickssery, Tony Wang, Francesco Mosconi, Anjali Gopal, … (+5 more) — 2025-08-19 — Anthropic — Anthropic Alignment Science Blog

Summary

Develops and evaluates a pretraining data filtering approach that uses classifiers to identify and remove harmful CBRN weapons information from training data, then pretrains models from scratch on filtered datasets to reduce dangerous capabilities while preserving useful capabilities.

Key Result

Reduced harmful CBRN capability evaluation performance by 33% relative to random baseline (from 33.7±0.4% to 30.8±0.4% accuracy) while causing no significant drop in harmless capabilities including MMLU, Code, and Prose.

Source