Partition Generative Modeling: Masked Modeling Without Masks

Justin Deschenaux, Lan Tran, Caglar Gulcehre

LLMs & Reasoning Fri, Apr 24 · 10:54 AM–11:04 AM · 201 A/B Avg rating: 7.00 (6–8)

Author-provided TL;DR

We show that it is possible to train masked generative models without using MASK tokens, resulting in efficiency gains at inference.

Abstract

Masked generative models (MGMs) can generate tokens in parallel and in any order, unlike autoregressive models (ARMs), which decode one token at a time, left-to-right. However, MGMs process the full-length sequence at every sampling step, including \mask tokens that carry no information. In contrast, ARMs process only the previously generated tokens. We introduce ``Partition Generative Models'' (PGMs), which replace masking with partitioning. Tokens are split into two groups that cannot attend to each other, and the model learns to predict each group conditioned on the other, eliminating mask tokens entirely. Because the groups do not interact, PGMs can process only the clean tokens during sampling, like ARMs, while retaining parallel, any-order generation, like MGMs. On OpenWebText, PGMs achieve $5-5.5\times$ higher throughput than MDLM while producing samples with lower Generative Perplexity. On ImageNet, PGMs reach comparable FID to MaskGIT with a $7.5\times$ throughput improvement. With twice as many steps, the FID improves to 4.56 while remaining $3.9\times$ faster than MGMs. Finally, PGMs remain compatible with existing MGM samplers and distillation methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Partition Generative Models replace masking with partitioning for efficient parallel generation, achieving higher throughput than masked generative models.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces PGM approach replacing masking with token partitioning where groups cannot attend to each other
Eliminates mask tokens entirely while retaining parallel, any-order generation capabilities like masked models
Achieves 5-5.5x higher throughput than MDLM on OpenWebText and 7.5x improvement over MaskGIT on ImageNet

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Token partitioning
Parallel generation
Iterative updating

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

OpenWebText
ImageNet

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Models require slight increase in parameters to match validation perplexity of MDLM baseline, attributed to GroupSwap layer
from the paper
Training slightly more computationally expensive than baseline due to torch's default attention implementation
from the paper
Application to multimodal settings remains open direction
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Explore optimizations to PGM architecture including more efficient GroupSwap mechanisms
from the paper
Investigate distillation techniques specifically designed for PGMs
from the paper
Extend approach to multimodal settings
from the paper

Author keywords

masked generative modeling
discrete diffusion
masked diffusion language modeling
diffusion language modeling

Something off? Let us know →

Partition Generative Modeling: Masked Modeling Without Masks

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis