Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, … (+4 more) — 2025-08-08 — Anthropic, Redwood Research, EleutherAI, University of Oxford — arXiv

Summary

Introduces a scalable pipeline for filtering dual-use content from pretraining data and demonstrates that this produces models with substantial resistance to adversarial fine-tuning attacks, outperforming post-training safeguards by over an order of magnitude.

Key Result

Models pretrained with filtered data resisted adversarial fine-tuning for up to 10,000 steps and 300M tokens of biothreat-related text, outperforming existing post-training baselines by over an order of magnitude, though they can still leverage dangerous information provided in context.

Source