Multiplayer Nash Preference Optimization

Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, … (+5 more) — 2025-09-27 — Stanford University, University of Washington, AI2 — arXiv

Summary

Extends Nash learning from human feedback (NLHF) to multiplayer settings, formulating alignment as an n-player game where policies compete against populations of opponents while being regularized toward a reference model.

Key Result

MNPO consistently outperforms existing two-player NLHF baselines (INPO, ONPO, EGPO) on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios.

Source