ICLR 2026 Orals

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, Luca Zappella

LLMs & Reasoning Fri, Apr 24 · 3:51 PM–4:01 PM · Amphitheater Avg rating: 6.50 (6–8)
Author-provided TL;DR

We break the sequential bottleneck of nonlinear RNNs, enabling training of billion-scale LSTM/GRU models, competitive with modern architectures

Abstract

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to $665\times$ over na\"ive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Enables parallel training of nonlinear RNNs via Newton's method achieving 665x speedup over sequential application.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Introduces ParaRNN framework breaking sequence-parallelization barrier for nonlinear RNNs
  • Casts nonlinear recurrence as system solved in parallel using Newton iterations with custom reductions
  • Achieves speedups up to 665x over naive sequential application
  • Demonstrates 7B parameter nonlinear RNNs achieve comparable perplexity to Transformers and Mamba2
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Parallel training
  • Newton's method
  • LSTM
  • GRU
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Relies on convergence of Newton's iterations; practical convergence in O(1) iterations needed for feasibility
    from the paper
  • Introduces additional overhead computations not necessary in sequential case, particularly for dense Jacobians
    from the paper
  • Jacobian structure constraints needed for computational tractability similar to Mamba diagonal assumption
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • RNN
  • Mamba
  • SSM
  • Transformers
  • Parallelization
  • Parallel scan
  • Nonlinear

Related orals

Something off? Let us know →