Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

LLMs & Reasoning Fri, Apr 24 · 10:54 AM–11:04 AM · 202 A/B Avg rating: 6.50 (6–8)

Author-provided TL;DR

Memorization skills consistently benefit from higher sparsity, while reasoning skills require balancing active FLOPs with total tokens per parameter; the optimal point shifts with the compute budget.

Abstract

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture‑of‑Experts (MoE) models, now standard in state‑of‑the‑art systems, introduce a new sparsity dimension that current dense‑model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. All code, data sources, and logs are released to facilitate reproducibility and future work.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MoE sparsity investigation reveals optimal balance between active FLOPs and tokens-per-parameter for reasoning versus memorization.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Investigates optimal sparsity for Mixture-of-Experts models under fixed compute budgets
Reveals two principles: Active FLOPs enhance reasoning accuracy; optimal tokens-per-parameter differs by task type
Shows reasoning tasks are data-hungry and peak near 20 tokens-per-parameter, while memorization benefits from sparsity
Demonstrates neither GRPO nor test-time compute alter sparsity-performance relationships

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Mixture-of-Experts models
Scaling laws
Routing strategies

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

All models trained on 125B token corpus which is Chinchilla-optimal for dense models; larger corpora could shift optimal sparsity
from the paper
Study limited to Mixtral architecture; does not exhaustively explore all MoE architectural choices
from the paper
Does not explore staged SFT or curriculum learning within SFT realm
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Train larger models with higher tokens-per-parameter to determine if optimal sparsity shifts toward sparser configurations
from the paper

Author keywords

Mixture of Experts
memorization
reasoning
scaling laws
large language models

Something off? Let us know →

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis