Why DPO is a Misspecified Estimator and How to Fix It

Aditya Gopalan, Sayak Ray Chowdhury, Debangshu Banerjee

LLMs & Reasoning Sat, Apr 25 · 11:42 AM–11:52 AM · 202 A/B Avg rating: 6.67 (6–8)

Author-provided TL;DR

DPO is not sound by design and can fail due to misspecification, we fix it with careful analysis.

Abstract

Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

AuxDPO introduces auxiliary variables mitigating DPO misspecification and moving toward RLHF solutions.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Characterization of DPO as statistical estimation problem over reward functions
Identifies failure modes when true reward function cannot be realized by policy class
Geometric characterization of two-stage RLHF relating to natural gradient in policy space
AuxDPO with auxiliary variables achieving superior performance over DPO variants

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Direct preference optimization
RLHF
Auxiliary variable learning
Reward function estimation

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

MMLU-Pro
REWARDBENCH V2
ULTRAFEEDBACK

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Direct Preference Optimization
Reinforcement Learning
Reinforcement learning with human feedback

Something off? Let us know →

Why DPO is a Misspecified Estimator and How to Fix It

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis