ICLR 2026 Orals

Multiplayer Nash Preference Optimization

Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi

LLMs & Reasoning Sat, Apr 25 · 10:54 AM–11:04 AM · 202 A/B Avg rating: 6.00 (4–8)

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods grounded in the Bradley–Terry assumption struggle to capture the nontransitivity and heterogeneity of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO that offer strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, introducing a single-opponent bias that fails to capture the full complexity of realistic preference structures. This work introduces Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Comprehensive empirical evaluation shows that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at~\url{https://github.com/smiles724/MNPO}.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MNPO extends Nash learning to multiplayer regime for aligning LLMs with heterogeneous human preferences via n-player game formulation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Generalization of Nash learning from human feedback to multiplayer settings with population dynamics
  • Equilibrium guarantees inherited from two-player methods while enabling richer competitive dynamics
  • Improved alignment quality under heterogeneous annotator conditions and mixed-policy evaluation
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • game theory
  • Nash equilibrium
  • preference optimization
  • policy competition
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • instruction-following benchmarks
  • preference-alignment benchmarks
  • reasoning benchmarks
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Performance fundamentally linked to preference data quality
    from the paper
  • Theoretical analysis limited to homogeneous setting with same preference oracle
    from the paper
  • HT-MNPO heterogeneous extension lacks formal convergence guarantees
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Explore more nuanced feedback mechanisms for high-performance regime learning
    from the paper
  • Investigate alternative equilibrium concepts like coarse correlated equilibrium for heterogeneous settings
    from the paper

Author keywords

  • Preference Optimization
  • RLHF
  • LLM Post-training

Related orals

Something off? Let us know →