ICLR 2026 Orals

On the Reasoning Abilities of Masked Diffusion Language Models

Anej Svete, Ashish Sabharwal

LLMs & Reasoning Sat, Apr 25 · 10:42 AM–10:52 AM · Amphitheater Avg rating: 7.00 (6–8)
Author-provided TL;DR

We prove that masked text diffusion models are equivalent to padded looped transformers, can solve all problems that chain-of-thought transformers can, and are more efficient on certain problem classes due to their parallel generation mechanism.

Abstract

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Theoretical analysis shows difficult examples hurt unsupervised contrastive learning generalization more than supervised settings.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Theoretical framework modeling sample similarity revealing difficult examples negatively affect generalization bounds
  • Demonstrates that removing difficult examples, margin tuning and temperature scaling improve generalization
  • Simple and efficient mechanism for selecting difficult examples with theoretical and empirical validation
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Contrastive learning
  • Generalization analysis
  • Sample removal strategies
  • Temperature scaling
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • diffusion language models
  • formal language theory
  • boolean circuits
  • expressivity
  • transformers
  • masked diffusion models
  • chain of thought
  • looped transformers

Related orals

Something off? Let us know →