Triple-BERT: Do We Really Need MARL for Order Dispatch on Ride-Sharing Platforms?
Zijian Zhao, Sen Li
This paper proposes a novel centralized reinforcement learning framework for large-scale order dispatching tasks in ride-sharing scenarios, achieving better cooperation among workers compared to previous multi-agent methods.
Abstract
On-demand ride-sharing platforms, such as Uber and Lyft, face the intricate real-time challenge of bundling and matching passengers—each with distinct origins and destinations—to available vehicles, all while navigating significant system uncertainties. Due to the extensive observation space arising from the large number of drivers and orders, order dispatching, though fundamentally a centralized task, is often addressed using Multi-Agent Reinforcement Learning (MARL). However, independent MARL methods fail to capture global information and exhibit poor cooperation among workers, while Centralized Training Decentralized Execution (CTDE) MARL methods suffer from the curse of dimensionality. To overcome these challenges, we propose Triple-BERT, a centralized Single Agent Reinforcement Learning (MARL) method designed specifically for large-scale order dispatching on ride-sharing platforms. Built on a variant TD3, our approach addresses the vast action space through an action decomposition strategy that breaks down the joint action probability into individual driver action probabilities. To handle the extensive observation space, we introduce a novel BERT-based network, where parameter reuse mitigates parameter growth as the number of drivers and orders increases, and the attention mechanism effectively captures the complex relationships among the large pool of driver and orders. We validate our method using a real-world ride-hailing dataset from Manhattan. Triple-BERT achieves approximately an 11.95% improvement over current state-of-the-art methods, with a 4.26% increase in served orders and a 22.25% reduction in pickup times. Our code, trained model parameters, and processed data are publicly available at https://github.com/RS2002/Triple-BERT .
Characterizes online learning with ranking feedback showing sublinear regret impossible in general, possible with variation bounds.
- Shows sublinear regret impossible with instantaneous-utility ranking feedback and deterministic Plackett-Luce models
- Develops algorithms achieving sublinear regret under utility sequence sublinear total variation assumption
- Demonstrates equilibrium computation in repeated games when all players follow proposed algorithms
- Online learning
- Ranking feedback
- Plackett-Luce model
- Game theory
Analysis focuses on external-regret metric and classical settings
from the paperGap between lower-bound and positive result for time-average utility ranking under bandit feedback remains
from the paper
Close gap between lower-bound and positive result for AvgUtil Rank under bandit feedback
from the paperApply algorithms to real-world ranking feedback datasets in ride-sharing and match-dating domains
from the paper
Author keywords
- Reinforcement Learning
- Order Dispatching
- Ride Sharing
Related orals
Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.
Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.