Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, jianwei zhang, Junyang Lin, Zhi-Qi Cheng
Abstract
Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose \emph{Hierarchical Speculative Decoding (HSD)}, a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12\% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.
Hierarchical Speculative Decoding uses lossless verification to maximize accepted tokens while preserving target distribution fidelity.
- Proposes HSD, provably lossless verification method balancing excess and deficient probability mass across branches
- Overcomes joint intractability by hierarchical approach avoiding surrogate approximations
- Demonstrates 12% performance gain when integrated with EAGLE-3 without compromising distribution fidelity
- Speculative decoding
- Hierarchical verification
- Probability distribution balancing
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- Speculative Decoding
- Joint Intractability
- Lossless Verification
Related orals
TileLang: Bridge Programmability and Performance in Modern Neural Kernels
TileLang enables hardware-aware fused kernel programming with tile inference and recommendation achieving 5-6x speedup.
Probabilistic Kernel Function for Fast Angle Testing
Proposes probabilistic kernel functions for angle testing enabling efficient approximate nearest neighbor search.
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Generates minute-long high-resolution videos efficiently with linear attention and constant-memory KV cache for block autoregression.
Efficient Resource-Constrained Training of Transformers via Subspace Optimization
WASI applies subspace-based training to transformer models reducing memory by 62x and FLOPs by 2x while maintaining accuracy on edge devices.
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Analyzes low-precision flash attention training failure caused by low-rank representations and biased BF16 rounding errors.