Unlearning Needs to be More Selective [Progress Report]

Filip Sondej, Yushi Yang, Marcel Windys — 2025-06-27 — LessWrong

Summary

Presents MUDMAN, a new machine unlearning method using Disruption Masking that allows only weight updates where unlearning and retain gradients align, achieving 40% improvement over state-of-the-art TAR method in robustness to fine-tuning attacks.

Key Result

MUDMAN outperforms the current state-of-the-art unlearning method (TAR) by 40% through combining selectivity (Disruption Masking), MAML meta-learning, and careful gradient-based updates.

Source

Link: https://lesswrong.com/posts/QYzofMbzmbgiwfqy8/unlearning-needs-to-be-more-selective-progress-report
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment
Editorial blurb (verbatim): [Unlearning Needs to be More Selective \[Progress Report\]](https://lesswrong.com/posts/QYzofMbzmbgiwfqy8/unlearning-needs-to-be-more-selective-progress-report)

capability-removal-unlearning

AI Safety Compendium

Explorer

Unlearning Needs to be More Selective [Progress Report]

Unlearning Needs to be More Selective [Progress Report]

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Unlearning Needs to be More Selective [Progress Report]

Unlearning Needs to be More Selective [Progress Report]

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents