The Coverage Principle: How Pre-Training Enables Post-Training

Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J Foster

LLMs & Reasoning Thu, Apr 23 · 4:15 PM–4:25 PM · 201 A/B Avg rating: 7.33 (6–8)

Author-provided TL;DR

We introduce the coverage profile, which captures the relationship between pre- and post-training performance and admits a rich statistical theory

Abstract

Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross entropy can be poorly predictive of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of coverage, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods like Best-of-N to succeed. Our main results develop an understanding of the coverage principle, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: coverage generalizes faster than cross entropy, avoiding spurious dependence on problem dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Develops theory linking pre-training coverage to post-training success through model scaling and practical algorithms.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Develops understanding of coverage principle where next-token prediction implicitly optimizes toward good coverage
Proves coverage is necessary and sufficient for post-training methods like Best-of-N to succeed
Shows coverage generalizes faster than cross-entropy, avoiding spurious dependence on sequence length
Provides algorithmic interventions with theoretical benefits for model selection, gradient normalization, and decoding

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Pre-training
Coverage analysis
Model scaling

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Relax simplifying assumptions in problem formulation
from the paper
Investigate coverage under misspecification settings
from the paper
Handle distribution shift between pre-training and post-training
from the paper
Clarify minimal conditions required for RL methods beyond coverage
from the paper
Extend to chain-of-thought reasoning with separated reasoning and answer components
from the paper

Author keywords

language models
reinforcement learning
test-time scaling
statistical learning theory

Something off? Let us know →

The Coverage Principle: How Pre-Training Enables Post-Training

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis