WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Changxin Tian, jiapeng wang, Qian Zhao, Kunlong Chen, Jia Liu, Ziqi Liu, Jiaxin Mao, Xin Zhao, Zhiqiang Zhang, JUN ZHOU

LLMs & Reasoning Fri, Apr 24 · 10:42 AM–10:52 AM · 202 A/B Avg rating: 7.00 (2–10)

Abstract

Recent advances in learning rate~(LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies—including cosine decay, linear decay and inverse square root decay—as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration—the training window for checkpoint aggregation—as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. With the high-quality annealing data, our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5\% on MATH, +2.9\% on HumanEval, and +5.5\% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

WSM establishes theoretical connection between LR decay and model merging for improved LLM pre-training.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Formal connection between learning rate decay and model merging through checkpoint averaging
Unified theoretical foundation emulating various decay strategies as principled model averaging schemes
Identifies merge duration as most critical factor influencing performance
Demonstrates +3.5% MATH, +2.9% HumanEval, +5.5% MMLU-Pro improvements over WSD baseline

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Checkpoint merging
Learning rate scheduling
Model averaging
Offline/online annealing

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

MATH
HumanEval
MMLU-Pro

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

llm pre-training
learning rate schedule
checkpoint merging
decay-free approach

Something off? Let us know →

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis