CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov — 2025-08-19 — arXiv

Summary

Introduces CRISP, a parameter-efficient method for persistent concept unlearning that uses sparse autoencoders to identify and suppress salient features across multiple layers, creating lasting parameter changes rather than inference-time interventions.

Key Result

CRISP successfully removes harmful knowledge from WMDP benchmark while preserving general and in-domain capabilities, outperforming prior unlearning approaches and achieving semantically coherent separation between target and benign concepts.

Source

Link: https://arxiv.org/abs/2508.13650
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)

sparse-coding

AI Safety Compendium

Explorer

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents