Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Marta Guimaraes, Claudia Soares — 2025-10-14 — arXiv

Summary

Proposes CALM, an inference-time method that suppresses harmful concepts in large language models by modifying latent representations using concept whitening and orthogonal projection, without requiring retraining or additional training data.

Key Result

CALM reduces harmful outputs and outperforms baseline methods in most metrics while incurring only small computational overhead at inference time.

Source

Link: https://arxiv.org/abs/2510.12672
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability

activation-engineering

AI Safety Compendium

Explorer

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents