Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers
Ruben Belo, Marta Guimaraes, Claudia Soares — 2025-10-14 — arXiv
Summary
Proposes CALM, an inference-time method that suppresses harmful concepts in large language models by modifying latent representations using concept whitening and orthogonal projection, without requiring retraining or additional training data.
Key Result
CALM reduces harmful outputs and outperforms baseline methods in most metrics while incurring only small computational overhead at inference time.
Source
- Link: https://arxiv.org/abs/2510.12672
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability