Train-before-Test Harmonizes Language Model Rankings

Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

LLMs & Reasoning Sat, Apr 25 · 3:15 PM–3:25 PM · 203 A/B Avg rating: 7.00 (6–8)

Abstract

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. While direct evaluation remains useful for assessing deployment-ready performance, train-before-test provides a complementary lens for understanding achievable performance of models after adaptation.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes train-before-test approach showing model potential rankings transfer across benchmarks better than direct evaluation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models with fine-tuning
Demonstrates model potential rankings exhibit remarkable consistency and external validity across benchmarks
Train-before-test restores connection between perplexity and downstream performance lost in direct evaluation

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Fine-tuning
Model evaluation
Benchmark design
Parameter-efficient fine-tuning

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Train-before-test increases evaluation cost through required fine-tuning on task-specific data
from the paper
Cross-benchmark ranking consistency remains imperfect; residual correlation may arise from incomplete PEFT adaptation or measurement noise
from the paper
Many benchmarks no longer provide training data, making train-before-test more difficult to apply
from the paper
Some commercial model providers don't easily allow fine-tuning
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Evaluation
Large language model

Something off? Let us know →

Train-before-Test Harmonizes Language Model Rankings

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis