ICLR 2026 Orals

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

LLMs & Reasoning Fri, Apr 24 · 10:30 AM–10:40 AM · 202 A/B Avg rating: 6.00 (6–6)

Abstract

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios. With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines. Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach's effectiveness across both pre-training and fine-tuning paradigms. Our code is publicly available at https://github.com/mrflogs/LoRA-Pre.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

LoRA-Pre low-rank optimizer reduces momentum matrix memory via online linear learner decomposition while maintaining optimization performance.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Reframing EMA momentum as training online linear regressor via gradient flow
  • Low-rank decomposition of momentum matrices maintaining EMA form in compressed space
  • Variants LoRA-PreAdam and LoRA-PreMuon with excellent rank efficiency achieving 1/8 baseline ranks
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • low-rank factorization
  • online linear regression
  • EMA decomposition
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • Large Language Models; Efficient Training; Low-Rank; LoRA

Related orals

Something off? Let us know →