Unlearning Isn’t Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, … (+2 more) — 2025-05-22 — arXiv

Summary

Introduces a representation-level analysis framework to evaluate machine unlearning in LLMs, demonstrating that standard metrics are misleading as unlearned information can be easily restored through minimal fine-tuning, revealing that current unlearning methods merely suppress rather than genuinely erase information.

Key Result

Across six unlearning methods, three data domains, and two LLMs, information appears to be suppressed rather than erased, as original behavior is easily restored through minimal fine-tuning, identifying four distinct forgetting regimes based on reversibility and catastrophicity.

Source

Link: https://arxiv.org/abs/2505.16831
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment

capability-removal-unlearning

AI Safety Compendium

Explorer

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Unlearning Isn’t Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Unlearning Isn’t Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents