The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana — 2025-10-09 — arXiv

Summary

Empirically demonstrates that improving truthfulness in LLMs weakens safety alignment (refusal behavior) due to overlapping model components, and proposes a method using sparse autoencoders and subspace orthogonalization to disentangle and preserve refusal features during fine-tuning.

Key Result

Increasing factual accuracy comes at the cost of weakened refusal behavior; the proposed SAE-based disentanglement method successfully preserves refusal behavior while reducing hallucinations on AdvBench and StrongReject benchmarks.

Source

Link: https://arxiv.org/abs/2510.07775
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)

sparse-coding

AI Safety Compendium

Explorer

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents