ICLR 2026 Orals

The Coverage Principle: How Pre-Training Enables Post-Training

Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J Foster

LLMs & Reasoning Thu, Apr 23 · 4:15 PM–4:25 PM · 201 A/B Avg rating: 7.33 (6–8)
Author-provided TL;DR

We introduce the coverage profile, which captures the relationship between pre- and post-training performance and admits a rich statistical theory

Abstract

Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross entropy can be poorly predictive of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of coverage, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods like Best-of-N to succeed. Our main results develop an understanding of the coverage principle, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: coverage generalizes faster than cross entropy, avoiding spurious dependence on problem dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Develops theory linking pre-training coverage to post-training success through model scaling and practical algorithms.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Develops understanding of coverage principle where next-token prediction implicitly optimizes toward good coverage
  • Proves coverage is necessary and sufficient for post-training methods like Best-of-N to succeed
  • Shows coverage generalizes faster than cross-entropy, avoiding spurious dependence on sequence length
  • Provides algorithmic interventions with theoretical benefits for model selection, gradient normalization, and decoding
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Pre-training
  • Coverage analysis
  • Model scaling
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Relax simplifying assumptions in problem formulation
    from the paper
  • Investigate coverage under misspecification settings
    from the paper
  • Handle distribution shift between pre-training and post-training
    from the paper
  • Clarify minimal conditions required for RL methods beyond coverage
    from the paper
  • Extend to chain-of-thought reasoning with separated reasoning and answer components
    from the paper

Author keywords

  • language models
  • reinforcement learning
  • test-time scaling
  • statistical learning theory

Related orals

Something off? Let us know →