Multiplayer Nash Preference Optimization

Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi

LLMs & Reasoning Sat, Apr 25 · 10:54 AM–11:04 AM · 202 A/B Avg rating: 6.00 (4–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods grounded in the Bradley–Terry assumption struggle to capture the nontransitivity and heterogeneity of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO that offer strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, introducing a single-opponent bias that fails to capture the full complexity of realistic preference structures. This work introduces Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Comprehensive empirical evaluation shows that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at~\url{https://github.com/smiles724/MNPO}.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MNPO extends Nash learning to multiplayer regime for aligning LLMs with heterogeneous human preferences via n-player game formulation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Generalization of Nash learning from human feedback to multiplayer settings with population dynamics
Equilibrium guarantees inherited from two-player methods while enabling richer competitive dynamics
Improved alignment quality under heterogeneous annotator conditions and mixed-policy evaluation

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

game theory
Nash equilibrium
preference optimization
policy competition

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

instruction-following benchmarks
preference-alignment benchmarks
reasoning benchmarks

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Performance fundamentally linked to preference data quality
from the paper
Theoretical analysis limited to homogeneous setting with same preference oracle
from the paper
HT-MNPO heterogeneous extension lacks formal convergence guarantees
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Explore more nuanced feedback mechanisms for high-performance regime learning
from the paper
Investigate alternative equilibrium concepts like coarse correlated equilibrium for heterogeneous settings
from the paper

Author keywords

Preference Optimization
RLHF
LLM Post-training

Something off? Let us know →

Multiplayer Nash Preference Optimization

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis