P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Xu Ze, Fei Huang, Yongbin Li, Kai Zhang

LLMs & Reasoning Thu, Apr 23 · 3:51 PM–4:01 PM · Amphitheater Avg rating: 4.67 (4–6)

Author-provided TL;DR

The first personalized generative reward model with test-time user-based scaling for preference alignment

Abstract

Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose **P-GenRM**, the first **P**ersonalized **Gen**erative **R**eward **M**odel with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user’s scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of ~2.31\%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional ~3\% boost, demonstrating stronger personalized alignment with test-time scalability.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

P-GenRM transforms user preferences into adaptive personas and scoring rubrics with test-time scaling for personalized reward modeling.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

First personalized generative reward model transforming preference signals into structured evaluation chains
Derives adaptive personas and scoring rubrics across various scenarios from user preferences
Introduces dual-granularity scaling at individual and prototype levels to reduce noise in preferences
Achieves state-of-the-art results with 2.31% average improvement and strong generalization to unseen users

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Generative reward modeling
User prototypes
Clustering
Test-time scaling

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

Personalized reward model benchmarks

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

personalizd alignment
generative reward model
test-time user-based scaling

Something off? Let us know →

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis