Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, … (+4 more) — 2025-08-08 — UK AI Security Institute, MIT, Eleuther AI

Source

Link: https://www.aisi.gov.uk/research/deep-ignorance-filtering-pretraining-data-builds-tamper-resistant-safeguards-into-open-weight-llms
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- harm-reduction-for-open-weights — Black-box safety (understand and control current model behaviour) / Goal robustness

harm-reduction-for-open-weights

AI Safety Compendium

Explorer

Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs

Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs

Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs

Source

Related Pages

Graph View

Graph view

Table of Contents