Unlearning Isn’t Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, … (+2 more) — 2025-05-22 — arXiv
Summary
Introduces a representation-level analysis framework to evaluate machine unlearning in LLMs, demonstrating that standard metrics are misleading as unlearned information can be easily restored through minimal fine-tuning, revealing that current unlearning methods merely suppress rather than genuinely erase information.
Key Result
Across six unlearning methods, three data domains, and two LLMs, information appears to be suppressed rather than erased, as original behavior is easily restored through minimal fine-tuning, identifying four distinct forgetting regimes based on reversibility and catastrophicity.
Source
- Link: https://arxiv.org/abs/2505.16831
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment