Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, … (+4 more) — 2025-08-08 — Anthropic, Redwood Research, EleutherAI, University of Oxford — arXiv

Summary

Introduces a scalable pipeline for filtering dual-use content from pretraining data and demonstrates that this produces models with substantial resistance to adversarial fine-tuning attacks, outperforming post-training safeguards by over an order of magnitude.

Key Result

Models pretrained with filtered data resisted adversarial fine-tuning for up to 10,000 steps and 300M tokens of biothreat-related text, outperforming existing post-training baselines by over an order of magnitude, though they can still leverage dangerous information provided in context.

Source

Link: https://arxiv.org/abs/2508.06601
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- data-filtering — Black-box safety (understand and control current model behaviour) / Better data

data-filtering

AI Safety Compendium

Explorer

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents