ICLR 2026 Orals

Partition Generative Modeling: Masked Modeling Without Masks

Justin Deschenaux, Lan Tran, Caglar Gulcehre

LLMs & Reasoning Fri, Apr 24 · 10:54 AM–11:04 AM · 201 A/B Avg rating: 7.00 (6–8)
Author-provided TL;DR

We show that it is possible to train masked generative models without using MASK tokens, resulting in efficiency gains at inference.

Abstract

Masked generative models (MGMs) can generate tokens in parallel and in any order, unlike autoregressive models (ARMs), which decode one token at a time, left-to-right. However, MGMs process the full-length sequence at every sampling step, including \mask tokens that carry no information. In contrast, ARMs process only the previously generated tokens. We introduce ``Partition Generative Models'' (PGMs), which replace masking with partitioning. Tokens are split into two groups that cannot attend to each other, and the model learns to predict each group conditioned on the other, eliminating mask tokens entirely. Because the groups do not interact, PGMs can process only the clean tokens during sampling, like ARMs, while retaining parallel, any-order generation, like MGMs. On OpenWebText, PGMs achieve $5-5.5\times$ higher throughput than MDLM while producing samples with lower Generative Perplexity. On ImageNet, PGMs reach comparable FID to MaskGIT with a $7.5\times$ throughput improvement. With twice as many steps, the FID improves to 4.56 while remaining $3.9\times$ faster than MGMs. Finally, PGMs remain compatible with existing MGM samplers and distillation methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Partition Generative Models replace masking with partitioning for efficient parallel generation, achieving higher throughput than masked generative models.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Introduces PGM approach replacing masking with token partitioning where groups cannot attend to each other
  • Eliminates mask tokens entirely while retaining parallel, any-order generation capabilities like masked models
  • Achieves 5-5.5x higher throughput than MDLM on OpenWebText and 7.5x improvement over MaskGIT on ImageNet
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Token partitioning
  • Parallel generation
  • Iterative updating
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • OpenWebText
  • ImageNet
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Models require slight increase in parameters to match validation perplexity of MDLM baseline, attributed to GroupSwap layer
    from the paper
  • Training slightly more computationally expensive than baseline due to torch's default attention implementation
    from the paper
  • Application to multimodal settings remains open direction
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Explore optimizations to PGM architecture including more efficient GroupSwap mechanisms
    from the paper
  • Investigate distillation techniques specifically designed for PGMs
    from the paper
  • Extend approach to multimodal settings
    from the paper

Author keywords

  • masked generative modeling
  • discrete diffusion
  • masked diffusion language modeling
  • diffusion language modeling

Related orals

Something off? Let us know →