Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Filip Sondej, Yushi Yang — 2025-09-15 — arXiv

Summary

Proposes Collapse of Irrelevant Representations (CIR), a technique using PCA on activations and gradients to identify and collapse common representation subspaces, enabling robust unlearning of dangerous knowledge while preserving general model performance.

Key Result

Achieves over 30x greater reduction in post-attack accuracy compared to Circuit Breakers baseline when unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, while disrupting general performance 30x less.

Source

Link: https://arxiv.org/abs/2509.11816
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment

capability-removal-unlearning

AI Safety Compendium

Explorer

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents