The Coverage Principle: How Pre-Training Enables Post-Training
Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J Foster
We introduce the coverage profile, which captures the relationship between pre- and post-training performance and admits a rich statistical theory
Abstract
Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross entropy can be poorly predictive of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of coverage, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods like Best-of-N to succeed. Our main results develop an understanding of the coverage principle, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: coverage generalizes faster than cross entropy, avoiding spurious dependence on problem dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.
Develops theory linking pre-training coverage to post-training success through model scaling and practical algorithms.
- Develops understanding of coverage principle where next-token prediction implicitly optimizes toward good coverage
- Proves coverage is necessary and sufficient for post-training methods like Best-of-N to succeed
- Shows coverage generalizes faster than cross-entropy, avoiding spurious dependence on sequence length
- Provides algorithmic interventions with theoretical benefits for model selection, gradient normalization, and decoding
- Pre-training
- Coverage analysis
- Model scaling
Authors did not state explicit limitations.
Relax simplifying assumptions in problem formulation
from the paperInvestigate coverage under misspecification settings
from the paperHandle distribution shift between pre-training and post-training
from the paperClarify minimal conditions required for RL methods beyond coverage
from the paperExtend to chain-of-thought reasoning with separated reasoning and answer components
from the paper
Author keywords
- language models
- reinforcement learning
- test-time scaling
- statistical learning theory
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.