ICLR 2026 Orals

The Art of Scaling Reinforcement Learning Compute for LLMs

Fnu Devvrit, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, Rishabh Agarwal

LLMs & Reasoning Sat, Apr 25 · 11:06 AM–11:16 AM · 202 A/B Avg rating: 7.50 (6–8)
Author-provided TL;DR

We study compute scaling properties of RL methods on LLMs

Abstract

Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a _best-practice_ recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a _scientific framework_ for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

ScaleRL provides principled framework for predicting RL compute scaling in LLMs through 400,000 GPU-hour study.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • First large-scale systematic study on RL scaling with sigmoidal compute-performance curves
  • Not all recipes yield similar asymptotic performance despite similar compute efficiency
  • Loss aggregation, normalization, and curriculum primarily modulate efficiency not asymptote
  • ScaleRL recipe enables stable, scalable training with predictable scaling trajectories
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Scaling laws
  • Reinforcement learning
  • Compute performance analysis
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Derive predictive scaling laws across pre-training compute, model size, and RL data
    from the paper
  • Study scaling with structured and dense rewards
    from the paper
  • Apply framework to multi-turn RL, agentic interaction, and long-form reasoning
    from the paper

Author keywords

  • Scaling
  • LLMs
  • Reasoning

Related orals

Something off? Let us know →