The Art of Scaling Reinforcement Learning Compute for LLMs

Fnu Devvrit, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, Rishabh Agarwal

LLMs & Reasoning Sat, Apr 25 · 11:06 AM–11:16 AM · 202 A/B Avg rating: 7.50 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We study compute scaling properties of RL methods on LLMs

Abstract

Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a _best-practice_ recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a _scientific framework_ for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

ScaleRL provides principled framework for predicting RL compute scaling in LLMs through 400,000 GPU-hour study.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

First large-scale systematic study on RL scaling with sigmoidal compute-performance curves
Not all recipes yield similar asymptotic performance despite similar compute efficiency
Loss aggregation, normalization, and curriculum primarily modulate efficiency not asymptote
ScaleRL recipe enables stable, scalable training with predictable scaling trajectories

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Scaling laws
Reinforcement learning
Compute performance analysis

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Derive predictive scaling laws across pre-training compute, model size, and RL data
from the paper
Study scaling with structured and dense rewards
from the paper
Apply framework to multi-turn RL, agentic interaction, and long-form reasoning
from the paper

Author keywords

Scaling
LLMs
Reasoning

Something off? Let us know →

The Art of Scaling Reinforcement Learning Compute for LLMs

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis