How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

LLMs & Reasoning Fri, Apr 24 · 11:06 AM–11:16 AM · 202 A/B Avg rating: 6.00 (6–6)

Author-provided TL;DR

Use model weight average to enhance curriculum learning in LLM pretraining.

Abstract

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Study reveals incompatibility between ascending quality curriculum and decaying learning rate in LLM pretraining, proposing moderated decay and model averaging solutions.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Identifying critical incompatibility between curriculum learning order and standard LR decay schedules
Two strategies: moderate LR decay where final LR is moderately smaller than peak, and replacing LR decay with model averaging
Combination approach improves benchmark average by 1.64% over random shuffling on 1.5B-parameter models

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

curriculum learning
learning rate scheduling
model averaging

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

MMLU
ARC-c
ARC-e
CSQA
OBQA
PIQA
SIQA
Winogrande

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

LLM pretraining
Curriculum Learning
Model Weight Average

Something off? Let us know →

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis