Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning
Filip Sondej, Yushi Yang — 2025-09-15 — arXiv
Summary
Proposes Collapse of Irrelevant Representations (CIR), a technique using PCA on activations and gradients to identify and collapse common representation subspaces, enabling robust unlearning of dangerous knowledge while preserving general model performance.
Key Result
Achieves over 30x greater reduction in post-attack accuracy compared to Circuit Breakers baseline when unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, while disrupting general performance 30x less.
Source
- Link: https://arxiv.org/abs/2509.11816
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment