ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models
Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, Luca Zappella
We break the sequential bottleneck of nonlinear RNNs, enabling training of billion-scale LSTM/GRU models, competitive with modern architectures
Abstract
Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to $665\times$ over na\"ive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.
Enables parallel training of nonlinear RNNs via Newton's method achieving 665x speedup over sequential application.
- Introduces ParaRNN framework breaking sequence-parallelization barrier for nonlinear RNNs
- Casts nonlinear recurrence as system solved in parallel using Newton iterations with custom reductions
- Achieves speedups up to 665x over naive sequential application
- Demonstrates 7B parameter nonlinear RNNs achieve comparable perplexity to Transformers and Mamba2
- Parallel training
- Newton's method
- LSTM
- GRU
Relies on convergence of Newton's iterations; practical convergence in O(1) iterations needed for feasibility
from the paperIntroduces additional overhead computations not necessary in sequential case, particularly for dense Jacobians
from the paperJacobian structure constraints needed for computational tractability similar to Mamba diagonal assumption
from the paper
Authors did not state explicit future directions.
Author keywords
- RNN
- Mamba
- SSM
- Transformers
- Parallelization
- Parallel scan
- Nonlinear
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.