On the Reasoning Abilities of Masked Diffusion Language Models
Anej Svete, Ashish Sabharwal
We prove that masked text diffusion models are equivalent to padded looped transformers, can solve all problems that chain-of-thought transformers can, and are more efficient on certain problem classes due to their parallel generation mechanism.
Abstract
Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.
Theoretical analysis shows difficult examples hurt unsupervised contrastive learning generalization more than supervised settings.
- Theoretical framework modeling sample similarity revealing difficult examples negatively affect generalization bounds
- Demonstrates that removing difficult examples, margin tuning and temperature scaling improve generalization
- Simple and efficient mechanism for selecting difficult examples with theoretical and empirical validation
- Contrastive learning
- Generalization analysis
- Sample removal strategies
- Temperature scaling
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- diffusion language models
- formal language theory
- boolean circuits
- expressivity
- transformers
- masked diffusion models
- chain of thought
- looped transformers
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.