Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, … (+4 more) — 2025-08-08 — Anthropic, Redwood Research, EleutherAI, University of Oxford — arXiv
Summary
Introduces a scalable pipeline for filtering dual-use content from pretraining data and demonstrates that this produces models with substantial resistance to adversarial fine-tuning attacks, outperforming post-training safeguards by over an order of magnitude.
Key Result
Models pretrained with filtered data resisted adversarial fine-tuning for up to 10,000 steps and 300M tokens of biothreat-related text, outperforming existing post-training baselines by over an order of magnitude, though they can still leverage dangerous information provided in context.
Source
- Link: https://arxiv.org/abs/2508.06601
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- data-filtering — Black-box safety (understand and control current model behaviour) / Better data