Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization

Rahul Krishna Thomas, Arka Pal

Causal & Statistical Methods Sat, Apr 25 · 11:06 AM–11:16 AM · Amphitheater Avg rating: 6.50 (6–8)

Author-provided TL;DR

We reduce optimal multi-draft speculative sampling to a convex minimization problem, and solve a truncated version to achieve state-of-the-art acceptance with negligible performance degradation.

Abstract

Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over $V^n$ variables, with $V$ being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most $V$ variables. This allows us to devise an algorithm for optimal $n$-draft speculative sampling when the $n$ tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top-$k$ draft sampling settings. Our findings give the first multi-draft algorithm with 90\% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Solves optimal multi-draft speculative sampling via convex optimization achieving 90% acceptance rates.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proves subset selection formulations equivalent to exponentially large relaxed OTLP
Reverses engineering of subset selection to formulate OTLP as max-flow problem
Applies polymatroid theory to reduce exponentially large OTLP to convex optimization in at most V variables
Achieves first multi-draft algorithm with 90% acceptance and under 100ms overhead per token

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Speculative sampling
Optimal transport
Convex optimization
Max-flow

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Examine efficacy in multi-step settings
from the paper
Extend to non-i.i.d. draft settings
from the paper

Author keywords

LLMs
Inference
Optimal Transport
Speculative Decoding

Something off? Let us know →

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization

Abstract

Author keywords

Related orals

Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks

CauKer: Classification Time Series Foundation Models Can Be Pretrained on Synthetic Data

Conformal Robustness Control: A New Strategy for Robust Decision

Structured Flow Autoencoders: Learning Structured Probabilistic Representations with Flow Matching