ICLR 2026 Orals

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Shu-Tao Xia, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu

Diffusion & Flow Matching Fri, Apr 24 · 11:06 AM–11:16 AM · 201 A/B Avg rating: 4.50 (2–6)

Abstract

Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, trained on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we have released our code and models to the community at https://github.com/stepfun-ai/NextStep-1.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

NextStep-1 achieves state-of-the-art autoregressive text-to-image generation by modeling continuous image tokens with lightweight flow matching instead of diffusion.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Proposes 14B autoregressive model paired with 157M flow matching head for continuous image token generation
  • Develops robust autoencoder using noise perturbation and token-wise input latent normalization to handle high-dimensional continuous tokens
  • Demonstrates competitive performance in both image generation and editing compared to diffusion-based methods
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Autoregressive generation
  • Flow matching
  • Next-token prediction
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Emergence of generative artifacts when transitioning to higher-dimensional latent spaces including local noise, block-shaped artifacts, global noise, and grid-like artifacts
    from the paper
  • Strictly sequential nature of autoregressive generation requires substantially more training steps to converge at higher resolutions compared to diffusion models
    from the paper
  • Techniques for high-resolution training less mature than those established for diffusion models
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Improve efficiency of flow matching head through parameter reduction or distillation
    from the paper
  • Accelerate autoregressive backbone by adapting advances like speculative decoding or multi-token prediction
    from the paper

Author keywords

  • Generative Models
  • Autoregressive Models
  • Diffusion Models
  • Text-to-image

Related orals

Generative Human Geometry Distribution

Introduces distribution-over-distribution model combining geometry distributions with two-stage flow matching for human 3D generation.

Avg rating: 5.50 (2–8) · Xiangjun Tang et al.
Something off? Let us know →