Distillation Robustifies Unlearning
Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, … (+3 more) — 2025-06-06 — arXiv (NeurIPS 2025 Spotlight)
Summary
Proposes UNDO (Unlearn-Noise-Distill-on-Outputs), a method that distills unlearned models to make capability removal robust to finetuning attacks, matching retraining-from-scratch robustness while using only 60-80% of compute.
Key Result
UNDO achieves robustness comparable to retraining from scratch with perfect data filtering on WMDP benchmark while requiring only 60-80% of compute and 0.01% of labeled pretraining data.
Source
- Link: https://arxiv.org/abs/2506.06278
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment