Reinforcement Learning & Agents

Reinforcement learning, decision-making, autonomous agents, multi-agent systems, and planning.

All papers

Min rating

Sort

AgentGym-RL: An Open-Source Framework to Train LLM Agents for Long-Horizon Decision Making via Multi-Turn RL

Presents unified RL framework for training LLM agents on long-horizon decision-making with staged interaction scaling.

Avg rating: 7.00 (6–10) · Zhiheng Xi et al.

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Presents AstaBench, comprehensive benchmark suite with production-grade tools for rigorous evaluation of AI agents on scientific research tasks.

Avg rating: 7.00 (6–8) · Jonathan Bragg et al.

Differentiable Model Predictive Control on the GPU

DiffMPC provides GPU-accelerated differentiable MPC solver leveraging problem structure for efficient parallelization.

Avg rating: 7.33 (6–8) · Emre Adabag et al.

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.

Avg rating: 7.33 (6–8) · Kaiwen Zheng et al.

Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces

Proposes Discount Model Search for quality diversity optimization in high-dimensional measure spaces.

Avg rating: 5.50 (4–8) · Bryon Tjanaka et al.

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

AIGB-Pearl enhances generative auto-bidding with trajectory evaluator and KL-Lipschitz-constrained optimization for safe exploration beyond offline data.

Avg rating: 6.00 (4–8) · Zhiyu Mou et al.

Exploratory Diffusion Model for Unsupervised Reinforcement Learning

Proposes ExDM using diffusion models for exploration and policy learning in unsupervised reinforcement learning.

Avg rating: 6.00 (6–6) · Chengyang Ying et al.

From movement to cognitive maps: recurrent neural networks reveal how locomotor development shapes hippocampal spatial coding

RNN models of hippocampus reveal how locomotor development statistics shape emergence of spatial neural representations.

Avg rating: 6.50 (2–10) · Marco P Abrate et al.

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.

Avg rating: 4.00 (2–6) · Harry Amad et al.

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

LPWM enables self-supervised object-centric world modeling with latent action module for stochastic video generation and control.

Avg rating: 7.33 (6–8) · Tal Daniel et al.

Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning

SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.

Avg rating: 6.00 (4–8) · Yaoyu Wang et al.

Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation

MVP achieves fastest one-step action generation with instantaneous velocity constraint providing high expressiveness for robotic control.

Avg rating: 7.00 (4–8) · Guojian Zhan et al.

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.

Avg rating: 6.50 (4–8) · Hongli Yu et al.

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

MomaGraph learns unified task-oriented scene representations integrating spatial-functional relationships for embodied agents to perform planning and manipulation.

Avg rating: 6.50 (6–8) · Yuanchen Ju et al.

Non-Asymptotic Analysis of (Sticky) Track-and-Stop

Provides first finite-confidence analysis of Track-and-Stop and Sticky Track-and-Stop algorithms for pure exploration problems.

Avg rating: 6.00 (4–8) · Riccardo Poiani et al.

OpenApps: Simulating Environment Variations to Measure UI Agent Reliability

OpenApps testbed reveals UI agent reliability varies drastically across app variations despite stable within-environment performance.

Avg rating: 6.50 (6–8) · Karen Ullrich et al.

Optimistic Task Inference for Behavior Foundation Models

OpTI-BFM uses optimistic decision criterion modeling uncertainty over reward functions to enable efficient task inference for behavior foundation models.

Avg rating: 6.50 (6–8) · Thomas Rupf et al.

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.

Avg rating: 6.50 (2–10) · Shuyang Jiang et al.

Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.

Avg rating: 6.00 (2–8) · Artyom Sorokin et al.

Rodrigues Network for Learning Robot Actions

Rodrigues Networks inject kinematics-aware inductive biases for improved action learning in articulated robot tasks.

Avg rating: 6.00 (2–8) · Jialiang Zhang et al.

Task-free Adaptive Meta Black-box Optimization

ABOM performs task-free adaptive meta black-box optimization using online parameter adaptation without predefined task distributions.

Avg rating: 5.50 (2–8) · Chao Wang et al.

TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning

Learns zero-shot RL representations via temporal difference latent prediction recovering successor factorization.

Avg rating: 7.50 (6–8) · Marco Bagatella et al.

Triple-BERT: Do We Really Need MARL for Order Dispatch on Ride-Sharing Platforms?

Characterizes online learning with ranking feedback showing sublinear regret impossible in general, possible with variation bounds.

Avg rating: 6.50 (6–8) · Zijian Zhao et al.

TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

TROLL replaces PPO clip objective with differentiable trust region projection for more stable and efficient LLM reward fine-tuning.

Avg rating: 6.50 (4–10) · Philipp Becker et al.