Reasoning with Sampling: Your Base Model is Smarter Than You Think
Aayush Karan, Yilun Du
We find a training-free sampling algorithm that achieves reasoning boosts on base models comparable to those obtained by RL techniques.
Abstract
Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilities can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
Power sampling algorithm elicits strong reasoning from base models at inference time via MCMC without additional training.
- Algorithm samples from base models without training using power distribution target
- Motivated by Markov chain Monte Carlo techniques applied to autoregressive generation
- Achieves single-shot reasoning performance on par with state-of-the-art RL-posttraining
- Avoids diversity collapse characteristic of RL-posttraining while maintaining verifier-free applicability
- MCMC sampling
- Power distribution sampling
- Iterative sampling
- MATH500
- HumanEval
- GPQA
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- LLMs
- reasoning
- MCMC
- sampling
- inference-time compute
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.