ICLR 2026 Orals

Train-before-Test Harmonizes Language Model Rankings

Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

LLMs & Reasoning Sat, Apr 25 · 3:15 PM–3:25 PM · 203 A/B Avg rating: 7.00 (6–8)

Abstract

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. While direct evaluation remains useful for assessing deployment-ready performance, train-before-test provides a complementary lens for understanding achievable performance of models after adaptation.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes train-before-test approach showing model potential rankings transfer across benchmarks better than direct evaluation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models with fine-tuning
  • Demonstrates model potential rankings exhibit remarkable consistency and external validity across benchmarks
  • Train-before-test restores connection between perplexity and downstream performance lost in direct evaluation
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Fine-tuning
  • Model evaluation
  • Benchmark design
  • Parameter-efficient fine-tuning
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Train-before-test increases evaluation cost through required fine-tuning on task-specific data
    from the paper
  • Cross-benchmark ranking consistency remains imperfect; residual correlation may arise from incomplete PEFT adaptation or measurement noise
    from the paper
  • Many benchmarks no longer provide training data, making train-before-test more difficult to apply
    from the paper
  • Some commercial model providers don't easily allow fine-tuning
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • Evaluation
  • Large language model

Related orals

Something off? Let us know →