ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, Luca Zappella

LLMs & Reasoning Fri, Apr 24 · 3:51 PM–4:01 PM · Amphitheater Avg rating: 6.50 (6–8)

Author-provided TL;DR

We break the sequential bottleneck of nonlinear RNNs, enabling training of billion-scale LSTM/GRU models, competitive with modern architectures

Abstract

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to $665\times$ over na\"ive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Enables parallel training of nonlinear RNNs via Newton's method achieving 665x speedup over sequential application.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces ParaRNN framework breaking sequence-parallelization barrier for nonlinear RNNs
Casts nonlinear recurrence as system solved in parallel using Newton iterations with custom reductions
Achieves speedups up to 665x over naive sequential application
Demonstrates 7B parameter nonlinear RNNs achieve comparable perplexity to Transformers and Mamba2

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Parallel training
Newton's method
LSTM
GRU

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Relies on convergence of Newton's iterations; practical convergence in O(1) iterations needed for feasibility
from the paper
Introduces additional overhead computations not necessary in sequential case, particularly for dense Jacobians
from the paper
Jacobian structure constraints needed for computational tractability similar to Mamba diagonal assumption
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

RNN
Mamba
SSM
Transformers
Parallelization
Parallel scan
Nonlinear

Something off? Let us know →

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis