CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov — 2025-08-19 — arXiv

Summary

Introduces CRISP, a parameter-efficient method for persistent concept unlearning that uses sparse autoencoders to identify and suppress salient features across multiple layers, creating lasting parameter changes rather than inference-time interventions.

Key Result

CRISP successfully removes harmful knowledge from WMDP benchmark while preserving general and in-domain capabilities, outperforming prior unlearning approaches and achieving semantically coherent separation between target and benign concepts.

Source