NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Shu-Tao Xia, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu

Diffusion & Flow Matching Fri, Apr 24 · 11:06 AM–11:16 AM · 201 A/B Avg rating: 4.50 (2–6)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Abstract

Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, trained on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we have released our code and models to the community at https://github.com/stepfun-ai/NextStep-1.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

NextStep-1 achieves state-of-the-art autoregressive text-to-image generation by modeling continuous image tokens with lightweight flow matching instead of diffusion.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes 14B autoregressive model paired with 157M flow matching head for continuous image token generation
Develops robust autoencoder using noise perturbation and token-wise input latent normalization to handle high-dimensional continuous tokens
Demonstrates competitive performance in both image generation and editing compared to diffusion-based methods

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Autoregressive generation
Flow matching
Next-token prediction

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Emergence of generative artifacts when transitioning to higher-dimensional latent spaces including local noise, block-shaped artifacts, global noise, and grid-like artifacts
from the paper
Strictly sequential nature of autoregressive generation requires substantially more training steps to converge at higher resolutions compared to diffusion models
from the paper
Techniques for high-resolution training less mature than those established for diffusion models
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Improve efficiency of flow matching head through parameter reduction or distillation
from the paper
Accelerate autoregressive backbone by adapting advances like speculative decoding or multi-token prediction
from the paper

Author keywords

Generative Models
Autoregressive Models
Diffusion Models
Text-to-image

Something off? Let us know →

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Abstract

Author keywords

Related orals

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models

Neon: Negative Extrapolation From Self-Training Improves Image Generation

Generative Human Geometry Distribution

Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport