Pre-training under infinite compute

Konwoo Kim, Suhas Kotha, Percy Liang, Tatsunori Hashimoto

LLMs & Reasoning Fri, Apr 24 · 11:42 AM–11:52 AM · 202 A/B Avg rating: 7.50 (6–8)

Author-provided TL;DR

Since compute grows faster than the web, we design simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better performance with sufficient compute.

Abstract

Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that existing data-constrained approaches of increasing epoch count and parameter count overfit, and we improve upon such recipes by tuning regularization, finding that the optimal weight decay is $30\times$ larger than standard practice. Since our regularized recipe monotonically decreases loss following a power law in parameter count, we estimate its best possible performance via the \textbf{asymptote} of its scaling law rather than the performance at a fixed compute budget. We then identify that ensembling independently trained models achieves a significantly lower loss asymptote than the regularized recipe. Our best intervention combining epoching, regularization, parameter scaling, and ensemble scaling achieves an asymptote at 200M tokens using $5.17\times$ less data than our baseline, and our data scaling laws predict that this improvement persists at higher token budgets. We find that our data efficiency gains can be realized at smaller parameter counts as we can distill an ensemble into a student model that is 8$\times$ smaller and retains $83$% of the ensembling benefit. Finally, our interventions designed for validation loss generalize to downstream benchmarks, achieving a $9$% improvement for pre-training evals. Our results show that simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Shows optimal weight decay is 30x larger than standard practice; ensembling achieves lower loss asymptote enabling data-efficient pre-training at scale.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Demonstrates optimal weight decay is 30x larger than standard practice in data-constrained settings
Ensemble scaling achieves significantly lower loss asymptote than single model training
Distillation enables smaller models retaining 83% of ensembling benefit

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Weight decay regularization
Model ensembling
Knowledge distillation
Scaling laws
Power law analysis

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

scaling laws
data efficiency
pre-training

Something off? Let us know →

Pre-training under infinite compute

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis