Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization
Rahul Krishna Thomas, Arka Pal
We reduce optimal multi-draft speculative sampling to a convex minimization problem, and solve a truncated version to achieve state-of-the-art acceptance with negligible performance degradation.
Abstract
Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over $V^n$ variables, with $V$ being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most $V$ variables. This allows us to devise an algorithm for optimal $n$-draft speculative sampling when the $n$ tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top-$k$ draft sampling settings. Our findings give the first multi-draft algorithm with 90\% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.
Solves optimal multi-draft speculative sampling via convex optimization achieving 90% acceptance rates.
- Proves subset selection formulations equivalent to exponentially large relaxed OTLP
- Reverses engineering of subset selection to formulate OTLP as max-flow problem
- Applies polymatroid theory to reduce exponentially large OTLP to convex optimization in at most V variables
- Achieves first multi-draft algorithm with 90% acceptance and under 100ms overhead per token
- Speculative sampling
- Optimal transport
- Convex optimization
- Max-flow
Authors did not state explicit limitations.
Examine efficacy in multi-step settings
from the paperExtend to non-i.i.d. draft settings
from the paper
Author keywords
- LLMs
- Inference
- Optimal Transport
- Speculative Decoding
Related orals
Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks
Develops causal structure learning framework for Hawkes processes identifying latent confounder subprocesses.
CauKer: Classification Time Series Foundation Models Can Be Pretrained on Synthetic Data
Generates diverse synthetic time series for pretraining foundation models with clear scaling laws.
Conformal Robustness Control: A New Strategy for Robust Decision
CRC optimizes prediction set construction under explicit robustness constraints instead of coverage for more efficient robust decisions.
Structured Flow Autoencoders: Learning Structured Probabilistic Representations with Flow Matching
Structured Flow Autoencoders integrate flow matching with graphical models for structured representation learning.