The Art of Scaling Reinforcement Learning Compute for LLMs
ScaleRL provides principled framework for predicting RL compute scaling in LLMs through 400,000 GPU-hour study.
Themes identified by Claude Sonnet 4.6 across the author-stated future-work and limitations sections of every oral paper. Each theme cites the papers that contribute to it. See how AI content is labeled.
Across papers on reasoning, code generation, and instruction following, authors consistently identify RL-based fine-tuning—GRPO, PPO variants, trust-region methods—as the key next step beyond supervised learning. Multiple papers note current experiments are limited to small models (≤14B) or narrow task distributions and call for scaling RL to larger architectures, multi-turn settings, and agentic interaction. The convergence on RL as the dominant lever for eliciting complex behaviors marks a field-wide shift away from pure SFT pipelines.
“Apply framework to multi-turn RL, agentic interaction, and long-form reasoning”
“Scale TROLL to larger models and Mixture-of-Experts architectures”
“Scaling method to larger architectures with adequate compute”
ScaleRL provides principled framework for predicting RL compute scaling in LLMs through 400,000 GPU-hour study.
Develops theory linking pre-training coverage to post-training success through model scaling and practical algorithms.
SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.
TROLL replaces PPO clip objective with differentiable trust region projection for more stable and efficient LLM reward fine-tuning.
DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.
Proposes token-importance guided DPO with gradient attribution weighting and triplet loss for fine-grained LLM alignment.
Triple-BERT addresses order dispatching via centralized SARL with action decomposition and BERT-based attention.
MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.
OpenApps testbed reveals UI agent reliability varies drastically across app variations despite stable within-environment performance.
ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.
A recurring limitation across language, vision, and audio models is the mismatch between training-time sequence lengths and deployment demands. Authors frequently cite the need for improved positional encodings, KV-cache compression, and memory architectures to handle longer inputs without retraining. Several papers also note the theoretical gap between results proved for short sequences and real-world long-context behaviors, and call for both architectural and theoretical work to bridge it.
“Develop more advanced memory architectures and training strategies for enhancing long-context capabilities of LLMs”
“Characterize how different positional embedding schemes affect minimum training length”
“Call on LLM builders to prioritize multi-turn reliability, as known remediations for simpler settings prove ineffective”
MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.
Compresses KV cache in reasoning models via thought-adaptive quantization and eviction achieving near-lossless accuracy.
MrRoPE generalizes RoPE-extension via radix system conversion, achieving train-short-test-long with doubled effective context window.
Quantitative bounds show training length required for length generalization depends on periodicity, locality, alphabet size, and model norms.
Study showing LLMs exhibit 39% average performance drop in multi-turn conversations, failing to recover from wrong contextual assumptions.
Shows tool-use enables state space models to achieve length generalization previously limited by fixed-size memory.
Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.
Theory of context length scaling through Intrinsic Entropy explaining optimal context length and training dataset size relationship.
Studies temporal superposition in RNNs showing how memory demands affect representational geometry and RNNs learn different encoding strategies.
SwingArena evaluates LLMs on GitHub issue solving via adversarial framework modeling submitter-reviewer collaboration with retrieval-augmented code generation.
Papers working on language, image, or audio systems uniformly name video understanding and generation as an outstanding next step, and many additionally flag missing modalities such as 3D, tabular, radar, or multilingual data. The pattern is consistent: benchmarks, reward models, and generation pipelines built for text or static images are noted as needing systematic generalization to richer temporal and sensory modalities. This reflects a convergence on omni-modal systems as the target architecture.
“Incorporate video understanding and generation tasks into evaluation system”
“Extend ADP beyond text to images, screen recordings, and multimodal data”
“Build unified audio representation for more scalable joint training”
UFEval provides unified fine-grained evaluation of multimodal LLM outputs with aspect and task generalization.
Omni-Reward addresses modality imbalance and preference rigidity with omni-modal reward modeling framework.
RALI framework aligns images to text representations from reasoning MLLMs using contrastive learning, achieving comparable image quality assessment performance with <5% parameters.
InfoTok achieves adaptive video tokenization using information-theoretic compression and ELBO-based routing.
Aggregates speech tokens into latent patches for efficient speech-text modeling with cross-modal alignment.
UALM unified audio language model handles understanding, text-to-audio generation, and multimodal reasoning in single model with UALM-Reason for cross-modal generative reasoning.
Revela enables self-supervised retriever learning by adapting language modeling objectives, achieving unsupervised SoTA on multiple retrieval benchmarks.
ADP lightweight protocol unifies 13 heterogeneous agent datasets into single training schema achieving 20% average performance gain over base models.
Systematic study reveals LLMs acquire visual perception priors from diverse data and reasoning priors from code/math corpora.
MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.
mCLM uses modular chemical language combining natural language and molecular building blocks for function-aware synthesis.
Multiple papers building evaluation suites or agent training environments note that current benchmarks are limited to short, simple tasks and that simulation fidelity, coverage of realistic adversarial conditions, and scalable automated evaluation remain open problems. Authors call for environments that support RL fine-tuning loops, multi-step interaction, and adversarial injection, pointing toward a community-wide infrastructure effort to support the next generation of agent research.
“Use OpenApps for scaling agent training pipelines via RL fine-tuning”
“Developing fresh benchmark problems using latest scientific knowledge, contamination-resistant and past training cutoff dates”
“Enable agents to learn through trial and error inside simulator rather than imitating recorded examples”
CyberGym benchmarks AI agents on 1,507 real-world vulnerabilities discovering 34 zero-days, showing top models achieve only 22% success on PoC generation.
Presents AstaBench, comprehensive benchmark suite with production-grade tools for rigorous evaluation of AI agents on scientific research tasks.
OpenApps testbed reveals UI agent reliability varies drastically across app variations despite stable within-environment performance.
SimuHome introduces Matter protocol-grounded smart home simulator and 600-episode benchmark evaluating LLM agents on device control and workflow scheduling.
Introduces RedTeamCUA framework with hybrid web-OS sandbox for adversarial testing of computer-use agents.
ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.
MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.
Introduces EditBench benchmark for real-world LLM code editing with 545 problems from actual developer usage.
BIRD-INTERACT benchmark evaluates LLMs on dynamic multi-turn text-to-SQL tasks with function-driven user simulator and dual interaction settings.
CounselBench large-scale benchmark with 2000 expert evaluations and 120 adversarial questions for evaluating LLMs in mental health question answering.
Papers on backdoor attacks, steganographic triggers, reward hacking, and model fingerprinting share a common structure: a threat is demonstrated at small scale and authors call for mitigation strategies, extension to larger models, and evaluation against adaptive adversaries. The repeated limitation that experiments are confined to models ≤3B–13B reveals a gap between the studied threat surface and the deployed model scale, and points toward scalable detection and defense as a pressing research direction.
“Develop technical mitigations for finetuning-activated attacks”
“Evaluate TRACE on more realistic, heterogeneous loopholes”
“Redesign deception benchmarks using statistical methods for detecting deception rather than assuming correctness of LLM responses”
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
FAB enables adversaries to create compromised LLMs that exhibit dormant adversarial behaviors triggered only during downstream finetuning.
Framework detects self-initiated deception in LLMs via statistical metrics showing both deceptive intention and behavior correlate with task difficulty.
Introduces RedTeamCUA framework with hybrid web-OS sandbox for adversarial testing of computer-use agents.
MRT systematically stress tests LLM agent monitoring revealing agent awareness dominates and hybrid scaffolding enables weak-to-strong.
Extended logit matrices reveal low-rank structure of language models enabling linear generation from unrelated prompts.
Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.
Ellipse signatures function as forgery-resistant model output identifiers based on high-dimensional geometric constraints.
Introduces semantically conditioned watermarks for robust and stealthy LLM fingerprinting robust to deployment scenarios.
LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.
Authors working on sparse autoencoders, causal interventions, computational graphs, and topological analysis of representations consistently note that current tools are insufficiently precise, do not scale to large models, and rely on unvalidated assumptions about linear representations. Future work directions coalesce around more faithful attribution methods, semantic grounding of identified features, and automated verification pipelines, suggesting that interpretability is moving from exploratory to engineering-grade.
“Improve interpretability tools such as more faithful sparse autoencoders and more precise attribution methods”
“Use learned features as state trackers for detecting significant changes in model behavior”
“Extract concepts and features from low-rank representation space for model-agnostic interpretability”
CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.
Extended logit matrices reveal low-rank structure of language models enabling linear generation from unrelated prompts.
Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.
Uses sparse autoencoders and foundation models to discover unknown causal effects in scientific trials.
WIMHF uses sparse autoencoders to extract human-interpretable features from preference data, enabling better understanding and curation of human feedback.
Uses persistent homology to characterize topological compression in LLM latent spaces induced by adversarial inputs.
Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.
LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.
Gradient leading-term analysis reveals how semantic associations emerge in transformers as compositions of bigram, interchangeability, and context mapping functions.
Study of causal interventions showing they produce out-of-distribution representations, proposing Counterfactual Latent loss to mitigate harmful divergences.
Papers on latent world models, compositional planning, and robot learning share a common roadblock: models trained on narrow domains (robotics, video games) must be generalized to diverse real-world settings, and current approaches lack reliable dynamics modeling and multi-modal conditioning. Authors point toward physics-guided training, integration of language and reward signals, and evaluation on real robotic hardware as the required next steps, reflecting a convergence on interactive simulation as foundational infrastructure for embodied AI.
“Enable unified multi-modal conditioning with simultaneous action, language and image signals”
“Physics-guided motion generation and physics-aware reinforcement post-training for precise dynamics modeling”
“Benchmark on large-scale real robotic datasets”
LPWM enables self-supervised object-centric world modeling with latent action module for stochastic video generation and control.
Learns zero-shot RL representations via temporal difference latent prediction recovering successor factorization.
Introduces closed-loop benchmark evaluating generative world models on embodied task performance rather than visual quality.
Rodrigues Networks inject kinematics-aware inductive biases for improved action learning in articulated robot tasks.
MVP achieves fastest one-step action generation with instantaneous velocity constraint providing high expressiveness for robotic control.
Develops methods for LMs to ask informative questions and make decisions under uncertainty using Bayesian Experimental Design.
Introduces CDGS integrating compositional diffusion with guided search for coherent long-horizon plan generation.
SimuHome introduces Matter protocol-grounded smart home simulator and 600-episode benchmark evaluating LLM agents on device control and workflow scheduling.