The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs
Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana — 2025-10-09 — arXiv
Summary
Empirically demonstrates that improving truthfulness in LLMs weakens safety alignment (refusal behavior) due to overlapping model components, and proposes a method using sparse autoencoders and subspace orthogonalization to disentangle and preserve refusal features during fine-tuning.
Key Result
Increasing factual accuracy comes at the cost of weakened refusal behavior; the proposed SAE-based disentanglement method successfully preserves refusal behavior while reducing hallucinations on AdvBench and StrongReject benchmarks.
Source
- Link: https://arxiv.org/abs/2510.07775
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)