ICLR 2026 Orals

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

LLMs & Reasoning Fri, Apr 24 · 11:06 AM–11:16 AM · 202 A/B Avg rating: 6.00 (6–6)
Author-provided TL;DR

Use model weight average to enhance curriculum learning in LLM pretraining.

Abstract

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Study reveals incompatibility between ascending quality curriculum and decaying learning rate in LLM pretraining, proposing moderated decay and model averaging solutions.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Identifying critical incompatibility between curriculum learning order and standard LR decay schedules
  • Two strategies: moderate LR decay where final LR is moderately smaller than peak, and replacing LR decay with model averaging
  • Combination approach improves benchmark average by 1.64% over random shuffling on 1.5B-parameter models
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • curriculum learning
  • learning rate scheduling
  • model averaging
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • MMLU
  • ARC-c
  • ARC-e
  • CSQA
  • OBQA
  • PIQA
  • SIQA
  • Winogrande
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • LLM pretraining
  • Curriculum Learning
  • Model Weight Average

Related orals

Something off? Let us know →