Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran — 2025-02-04 — arXiv (accepted to NeurIPS 2025)

Summary

Develops two novel distributionally robust DPO algorithms (WDPO and KLDPO) to address catastrophic alignment failures from preference distribution shift, with theoretical sample complexity analysis and scalable gradient-based implementations.

Key Result

WDPO and KLDPO substantially improve alignment robustness when user preferences shift across demographics, regions, and cultural contexts compared to standard DPO.

Source

Link: https://arxiv.org/abs/2502.01930
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

iterative-alignment-at-post-train-time

AI Safety Compendium

Explorer

Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents