CRISP: Persistent Concept Unlearning via Sparse Autoencoders
Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov — 2025-08-19 — arXiv
Summary
Introduces CRISP, a parameter-efficient method for persistent concept unlearning that uses sparse autoencoders to identify and suppress salient features across multiple layers, creating lasting parameter changes rather than inference-time interventions.
Key Result
CRISP successfully removes harmful knowledge from WMDP benchmark while preserving general and in-domain capabilities, outperforming prior unlearning approaches and achieving semantically coherent separation between target and benign concepts.
Source
- Link: https://arxiv.org/abs/2508.13650
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)