Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, … (+3 more) — 2025-06-06 — arXiv (NeurIPS 2025 Spotlight)

Summary

Proposes UNDO (Unlearn-Noise-Distill-on-Outputs), a method that distills unlearned models to make capability removal robust to finetuning attacks, matching retraining-from-scratch robustness while using only 60-80% of compute.

Key Result

UNDO achieves robustness comparable to retraining from scratch with perfect data filtering on WMDP benchmark while requiring only 60-80% of compute and 0.01% of labeled pretraining data.

Source