Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Marta Guimaraes, Claudia Soares — 2025-10-14 — arXiv

Summary

Proposes CALM, an inference-time method that suppresses harmful concepts in large language models by modifying latent representations using concept whitening and orthogonal projection, without requiring retraining or additional training data.

Key Result

CALM reduces harmful outputs and outperforms baseline methods in most metrics while incurring only small computational overhead at inference time.

Source