Robust LLM Alignment via Distributionally Robust Direct Preference Optimization
Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran — 2025-02-04 — arXiv (accepted to NeurIPS 2025)
Summary
Develops two novel distributionally robust DPO algorithms (WDPO and KLDPO) to address catastrophic alignment failures from preference distribution shift, with theoretical sample complexity analysis and scalable gradient-based implementations.
Key Result
WDPO and KLDPO substantially improve alignment robustness when user preferences shift across demographics, regions, and cultural contexts compared to standard DPO.
Source
- Link: https://arxiv.org/abs/2502.01930
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment