Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran — 2025-02-04 — arXiv (accepted to NeurIPS 2025)

Summary

Develops two novel distributionally robust DPO algorithms (WDPO and KLDPO) to address catastrophic alignment failures from preference distribution shift, with theoretical sample complexity analysis and scalable gradient-based implementations.

Key Result

WDPO and KLDPO substantially improve alignment robustness when user preferences shift across demographics, regions, and cultural contexts compared to standard DPO.

Source