NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Shu-Tao Xia, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu
Abstract
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, trained on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we have released our code and models to the community at https://github.com/stepfun-ai/NextStep-1.
NextStep-1 achieves state-of-the-art autoregressive text-to-image generation by modeling continuous image tokens with lightweight flow matching instead of diffusion.
- Proposes 14B autoregressive model paired with 157M flow matching head for continuous image token generation
- Develops robust autoencoder using noise perturbation and token-wise input latent normalization to handle high-dimensional continuous tokens
- Demonstrates competitive performance in both image generation and editing compared to diffusion-based methods
- Autoregressive generation
- Flow matching
- Next-token prediction
Emergence of generative artifacts when transitioning to higher-dimensional latent spaces including local noise, block-shaped artifacts, global noise, and grid-like artifacts
from the paperStrictly sequential nature of autoregressive generation requires substantially more training steps to converge at higher resolutions compared to diffusion models
from the paperTechniques for high-resolution training less mature than those established for diffusion models
from the paper
Improve efficiency of flow matching head through parameter reduction or distillation
from the paperAccelerate autoregressive backbone by adapting advances like speculative decoding or multi-token prediction
from the paper
Author keywords
- Generative Models
- Autoregressive Models
- Diffusion Models
- Text-to-image
Related orals
Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
RealUID provides universal distillation for matching models without GANs, incorporating real data into one-step generator training.
GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models
GLASS Flows samples Markov transitions via inner flow matching models to improve inference-time reward alignment in flow and diffusion models.
Neon: Negative Extrapolation From Self-Training Improves Image Generation
Neon inverts model degradation from self-training by extrapolating away from it, improving generative models with minimal compute.
Generative Human Geometry Distribution
Introduces distribution-over-distribution model combining geometry distributions with two-stage flow matching for human 3D generation.
Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport
Cross-domain lossy compression unifies rate and classification constraints via optimal transport framework.