ICLR 2026 Orals

TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann

Reinforcement Learning & Agents Sat, Apr 25 · 10:42 AM–10:52 AM · 202 A/B Avg rating: 6.50 (4–10)
Author-provided TL;DR

Replacing PPO's clipping objective with more principled trust regions improves RL from verifiable rewards.

Abstract

Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model’s most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model’s inference behavior. Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

TROLL replaces PPO clip objective with differentiable trust region projection for more stable and efficient LLM reward fine-tuning.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Introduces TROLL, trust-region based policy gradient objective replacing PPO-clip mechanism
  • Proposes novel discrete differentiable trust region projection providing token-level KL constraints
  • Extends to sparse distributions focusing on important token logits for computational efficiency
  • Consistently outperforms PPO-clip across models, tasks, and advantage-estimation methods
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Policy gradient methods
  • Trust region optimization
  • KL constraints
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Mathematical reasoning benchmarks
  • Code generation benchmarks
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Currently evaluates only on dense models up to 14B parameters
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Scale TROLL to larger models and Mixture-of-Experts architectures
    from the paper
  • Extend TROLL to other modalities such as vision-language models
    from the paper

Author keywords

  • RL from verifiable rewards
  • Finetuning LLMs
  • Trust Regions

Related orals

Something off? Let us know →