Train-before-Test Harmonizes Language Model Rankings
Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt
Abstract
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. While direct evaluation remains useful for assessing deployment-ready performance, train-before-test provides a complementary lens for understanding achievable performance of models after adaptation.
Proposes train-before-test approach showing model potential rankings transfer across benchmarks better than direct evaluation.
- Comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models with fine-tuning
- Demonstrates model potential rankings exhibit remarkable consistency and external validity across benchmarks
- Train-before-test restores connection between perplexity and downstream performance lost in direct evaluation
- Fine-tuning
- Model evaluation
- Benchmark design
- Parameter-efficient fine-tuning
Train-before-test increases evaluation cost through required fine-tuning on task-specific data
from the paperCross-benchmark ranking consistency remains imperfect; residual correlation may arise from incomplete PEFT adaptation or measurement noise
from the paperMany benchmarks no longer provide training data, making train-before-test more difficult to apply
from the paperSome commercial model providers don't easily allow fine-tuning
from the paper
Authors did not state explicit future directions.
Author keywords
- Evaluation
- Large language model
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.