Planner Aware Path Learning in Diffusion Language Models Training

Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Michael M. Bronstein, Anru Zhang, Joey Bose, Alexander Tong

Diffusion & Flow Matching Sat, Apr 25 · 10:54 AM–11:04 AM · Amphitheater Avg rating: 5.50 (4–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We propose Planner Aware Path Learning (PAPL), a simple planner-aligned training method for Diffusion Language Models that resolves the training–inference mismatch and consistently improves generation quality.

Abstract

Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through more flexible and parallel generation paths. This flexibility of sampling is unlocked by new engineered sampling strategies, or *planners*, that select more favorable generation paths by iteratively planning---versus uniformly at random---where to denoise along the sequence. However, by modifying the reverse paths via planning, planners create an irrevocable mismatch between the uniformly random denoising paths during training and planning-based inference. In this paper, we systematically investigate the mismatch of discrete diffusion training and inference under planning and theoretically prove that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser that uses a non-uniform planner. To address this gap, we derive a new planned evidence lower bound (P-ELBO) that incorporates planner-based reverse dynamics directly into the training objective. Using the P-ELBO, we introduce *Planner Aware Path Learning* (PAPL), a novel training scheme that aligns training and inference under a planned denoiser. PAPL is implemented as a simple yet effective modification to the standard masked discrete diffusion loss, making it widely applicable and easy to adopt. Empirically, we show PAPL delivers consistent gains across domains, including a 40\% relative improvement in protein sequences, improved text generation with up to a $4\times$ relative MAUVE gain, and 23\% relative improvement in code generation HumanEval pass@10.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Theoretical characterization shows MDMs are expressively equivalent to padded looped transformers, more efficient for parallel problems.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Establishes equivalence between masked diffusion models and polynomially-padded looped transformers in finite-precision log-width setting
Shows MDMs can solve all problems CoT-augmented transformers can with identification of expressivity gaps
Demonstrates MDMs inherently more efficient than CoT transformers on highly-parallelizable problems like regular languages

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Theoretical complexity analysis
Chain of thought reasoning
Transformer expressivity theory
Circuit complexity

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Investigate hybrid models combining autoregressive generation with non-autoregressive infilling within blocks
from the paper

Author keywords

Diffusion Language Models
Discrete Diffusion
Diffusion Models
code generation
protein generation
text generation

Something off? Let us know →

Planner Aware Path Learning in Diffusion Language Models Training

Abstract

Author keywords

Related orals

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models

Neon: Negative Extrapolation From Self-Training Improves Image Generation

Generative Human Geometry Distribution

Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport