Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs
Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, … (+4 more) — 2025-08-08 — UK AI Security Institute, MIT, Eleuther AI
Source
- Link: https://www.aisi.gov.uk/research/deep-ignorance-filtering-pretraining-data-builds-tamper-resistant-safeguards-into-open-weight-llms
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- harm-reduction-for-open-weights — Black-box safety (understand and control current model behaviour) / Goal robustness