Semi-Supervised Preference Optimization with Limited Feedback

Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song

LLMs & Reasoning Sat, Apr 25 · 10:30 AM–10:40 AM · 202 A/B Avg rating: 6.00 (2–8)

Abstract

The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Mistral-7B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

SSPO achieves data efficiency in preference optimization by pseudo-labeling unpaired data using theoretically-grounded reward thresholds.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proves existence of optimal reward threshold separating winning and losing responses with high probability
Enables principled pseudo-labeling of unpaired data using optimal threshold derived from small paired dataset
Uses adaptive scheduler creating curriculum learning dynamic shifting focus from labeled to unpaired signals

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Semi-supervised learning
Pseudo-labeling
Curriculum learning
Preference optimization

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

UltraFeedback

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Preference Optimization
Semi-Supervised Learning

Something off? Let us know →

Semi-Supervised Preference Optimization with Limited Feedback

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis