Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, … (+3 more) — 2025-06-06 — arXiv (NeurIPS 2025 Spotlight)

Summary

Proposes UNDO (Unlearn-Noise-Distill-on-Outputs), a method that distills unlearned models to make capability removal robust to finetuning attacks, matching retraining-from-scratch robustness while using only 60-80% of compute.

Key Result

UNDO achieves robustness comparable to retraining from scratch with perfect data filtering on WMDP benchmark while requiring only 60-80% of compute and 0.01% of labeled pretraining data.

Source

Link: https://arxiv.org/abs/2506.06278
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment

capability-removal-unlearning

AI Safety Compendium

Explorer

Distillation Robustifies Unlearning

Distillation Robustifies Unlearning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Distillation Robustifies Unlearning

Distillation Robustifies Unlearning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents