On the Reasoning Abilities of Masked Diffusion Language Models

Anej Svete, Ashish Sabharwal

LLMs & Reasoning Sat, Apr 25 · 10:42 AM–10:52 AM · Amphitheater Avg rating: 7.00 (6–8)

Author-provided TL;DR

We prove that masked text diffusion models are equivalent to padded looped transformers, can solve all problems that chain-of-thought transformers can, and are more efficient on certain problem classes due to their parallel generation mechanism.

Abstract

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Theoretical analysis shows difficult examples hurt unsupervised contrastive learning generalization more than supervised settings.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Theoretical framework modeling sample similarity revealing difficult examples negatively affect generalization bounds
Demonstrates that removing difficult examples, margin tuning and temperature scaling improve generalization
Simple and efficient mechanism for selecting difficult examples with theoretical and empirical validation

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Contrastive learning
Generalization analysis
Sample removal strategies
Temperature scaling

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

diffusion language models
formal language theory
boolean circuits
expressivity
transformers
masked diffusion models
chain of thought
looped transformers

Something off? Let us know →

On the Reasoning Abilities of Masked Diffusion Language Models

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis