Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
Yaoyu Wang, Hankun Dai, Zhidong Yang, Junmin Xiao, Guangming Tan
We propose SparseRL, a deep reinforcement learning framework that generates high-performance CUDA code for sparse matrix operations, achieving significant improvements in both correctness and execution efficiency.
Abstract
Code generation is a crucial research area in the field of artificial intelligence, holding the potential to revolutionize software development and streamline programming processes. However, generating the high-performance code, which need to be executed in a shorter time for the low-latency scenario, remains a formidable challenge. Existing methods often struggle to account for the irregularity of input sparse data in sparse programs and the need for domain-specific architectural knowledge, leading to sub-optimal performance. To tackle these issues, we propose the SparseRL framework. SparseRL leverages deep reinforcement learning, treating a pre-trained language model as a stochastic policy. It takes the row and column indices of non-zero elements in the sparse matrix as input and generates CUDA code as output for sparse matrix operations. We also introduce a domain-specific code generation mechanism for the dynamic input, a sinusoidal embedding technique tailored for sparse matrices, and a hierarchical reward function that considers both code correctness and execution efficiency. Experimental results demonstrate SparseRL achieves state-of-the-art performance. In sparse matrix-vector multiplication (SpMV) tasks, it improves the compilation rate by 20% compared to existing methods, and the generated code runs 30% faster on average. For sparse matrix-dense matrix multiplication (SpMM) tasks, SparseRL also shows significant performance gains. These results highlight the effectiveness of SparseRL in generating high-performance CUDA code for sparse matrix operations.
SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.
- Domain-specific code generation mechanism for dynamic sparse matrix inputs
- Sinusoidal embedding technique tailored for sparse matrices
- Hierarchical reward function considering both code correctness and execution efficiency
- 20% improvement in compilation rate and 30% faster execution on SpMV tasks
- Reinforcement learning
- Pretrained language models
- Domain-specific code generation
- University of Florida Sparse Matrix Collection
RL-based optimization is computationally expensive during fine-tuning due to compiler and executor interactions
from the paperMethod best-suited for scenarios where sparse code can be reused repeatedly due to generation and execution time overhead
from the paperExtension to other hardware backends is non-trivial
from the paper
Replace sparse matrix indices with task-specific structural features and use multi-modal adapters
from the paperAdapt hierarchical reward to task-specific metrics like loop execution time and parallelization speedup
from the paperReuse pretrain-SFT-RL workflow with task-specific training data for general code optimization
from the paper
Author keywords
- Reinforcement Learning
- CUDA Code Generation
- High-Performance Computing
Related orals
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.
Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.
Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training
Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.