ICLR 2026 Orals

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization

Rahul Krishna Thomas, Arka Pal

Causal & Statistical Methods Sat, Apr 25 · 11:06 AM–11:16 AM · Amphitheater Avg rating: 6.50 (6–8)
Author-provided TL;DR

We reduce optimal multi-draft speculative sampling to a convex minimization problem, and solve a truncated version to achieve state-of-the-art acceptance with negligible performance degradation.

Abstract

Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over $V^n$ variables, with $V$ being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most $V$ variables. This allows us to devise an algorithm for optimal $n$-draft speculative sampling when the $n$ tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top-$k$ draft sampling settings. Our findings give the first multi-draft algorithm with 90\% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Solves optimal multi-draft speculative sampling via convex optimization achieving 90% acceptance rates.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Proves subset selection formulations equivalent to exponentially large relaxed OTLP
  • Reverses engineering of subset selection to formulate OTLP as max-flow problem
  • Applies polymatroid theory to reduce exponentially large OTLP to convex optimization in at most V variables
  • Achieves first multi-draft algorithm with 90% acceptance and under 100ms overhead per token
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Speculative sampling
  • Optimal transport
  • Convex optimization
  • Max-flow
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Examine efficacy in multi-step settings
    from the paper
  • Extend to non-i.i.d. draft settings
    from the paper

Author keywords

  • LLMs
  • Inference
  • Optimal Transport
  • Speculative Decoding

Related orals

Something off? Let us know →