Safety Alignment via Constrained Knowledge Unlearning

Zesheng Shi, Yucheng Zhou, Jing Li — 2025-05-24 — arXiv

Summary

Proposes Constrained Knowledge Unlearning (CKU), a novel safety alignment method that identifies and preserves useful knowledge neurons while selectively pruning gradients during unlearning to remove harmful knowledge from LLMs without compromising performance.

Key Result

CKU significantly enhances model safety against jailbreak attacks while maintaining overall performance, offering superior balance between safety and utility compared to existing methods.

Source

Link: https://arxiv.org/abs/2505.18588
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment

capability-removal-unlearning

AI Safety Compendium

Explorer

Safety Alignment via Constrained Knowledge Unlearning

Safety Alignment via Constrained Knowledge Unlearning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Safety Alignment via Constrained Knowledge Unlearning

Safety Alignment via Constrained Knowledge Unlearning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents