The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana — 2025-10-09 — arXiv

Summary

Empirically demonstrates that improving truthfulness in LLMs weakens safety alignment (refusal behavior) due to overlapping model components, and proposes a method using sparse autoencoders and subspace orthogonalization to disentangle and preserve refusal features during fine-tuning.

Key Result

Increasing factual accuracy comes at the cost of weakened refusal behavior; the proposed SAE-based disentanglement method successfully preserves refusal behavior while reducing hallucinations on AdvBench and StrongReject benchmarks.

Source