Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Filip Sondej, Yushi Yang — 2025-09-15 — arXiv

Summary

Proposes Collapse of Irrelevant Representations (CIR), a technique using PCA on activations and gradients to identify and collapse common representation subspaces, enabling robust unlearning of dangerous knowledge while preserving general model performance.

Key Result

Achieves over 30x greater reduction in post-attack accuracy compared to Circuit Breakers baseline when unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, while disrupting general performance 30x less.

Source