Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han

Vision & 3D Fri, Apr 24 · 10:30 AM–10:40 AM · 201 A/B Avg rating: 7.00 (6–8)

Abstract

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256×256 res.) and 1024 to 48 (512×512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4× lower latency than previous parallelized autoregressive models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces parallel decoding for autoregressive image generation with flexible ordering achieving 3.4x latency reduction.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Flexible parallelized autoregressive modeling enabling arbitrary generation ordering and parallelization degrees
Learnable position query tokens guide generation while ensuring mutual visibility for consistent parallel decoding
Locality-aware generation ordering forming groups to minimize dependencies and maximize contextual support

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Autoregressive modeling
Parallel decoding
Diffusion models
Position embeddings
Transformer architecture

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

ImageNet

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Efficient Autoregressive Image Generation
Parallel Decoding

Something off? Let us know →

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

Abstract

Author keywords

Related orals

Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation

Depth Anything 3: Recovering the Visual Space from Any Views

Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Radiometrically Consistent Gaussian Surfels for Inverse Rendering

True Self-Supervised Novel View Synthesis is Transferable