Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota
Memorization skills consistently benefit from higher sparsity, while reasoning skills require balancing active FLOPs with total tokens per parameter; the optimal point shifts with the compute budget.
Abstract
Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture‑of‑Experts (MoE) models, now standard in state‑of‑the‑art systems, introduce a new sparsity dimension that current dense‑model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. All code, data sources, and logs are released to facilitate reproducibility and future work.
MoE sparsity investigation reveals optimal balance between active FLOPs and tokens-per-parameter for reasoning versus memorization.
- Investigates optimal sparsity for Mixture-of-Experts models under fixed compute budgets
- Reveals two principles: Active FLOPs enhance reasoning accuracy; optimal tokens-per-parameter differs by task type
- Shows reasoning tasks are data-hungry and peak near 20 tokens-per-parameter, while memorization benefits from sparsity
- Demonstrates neither GRPO nor test-time compute alter sparsity-performance relationships
- Mixture-of-Experts models
- Scaling laws
- Routing strategies
All models trained on 125B token corpus which is Chinchilla-optimal for dense models; larger corpora could shift optimal sparsity
from the paperStudy limited to Mixtral architecture; does not exhaustively explore all MoE architectural choices
from the paperDoes not explore staged SFT or curriculum learning within SFT realm
from the paper
Train larger models with higher tokens-per-parameter to determine if optimal sparsity shifts toward sparser configurations
from the paper
Author keywords
- Mixture of Experts
- memorization
- reasoning
- scaling laws
- large language models
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.