ICLR 2026 Orals

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

LLMs & Reasoning Fri, Apr 24 · 10:54 AM–11:04 AM · 202 A/B Avg rating: 6.50 (6–8)
Author-provided TL;DR

Memorization skills consistently benefit from higher sparsity, while reasoning skills require balancing active FLOPs with total tokens per parameter; the optimal point shifts with the compute budget.

Abstract

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture‑of‑Experts (MoE) models, now standard in state‑of‑the‑art systems, introduce a new sparsity dimension that current dense‑model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. All code, data sources, and logs are released to facilitate reproducibility and future work.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MoE sparsity investigation reveals optimal balance between active FLOPs and tokens-per-parameter for reasoning versus memorization.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Investigates optimal sparsity for Mixture-of-Experts models under fixed compute budgets
  • Reveals two principles: Active FLOPs enhance reasoning accuracy; optimal tokens-per-parameter differs by task type
  • Shows reasoning tasks are data-hungry and peak near 20 tokens-per-parameter, while memorization benefits from sparsity
  • Demonstrates neither GRPO nor test-time compute alter sparsity-performance relationships
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Mixture-of-Experts models
  • Scaling laws
  • Routing strategies
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • All models trained on 125B token corpus which is Chinchilla-optimal for dense models; larger corpora could shift optimal sparsity
    from the paper
  • Study limited to Mixtral architecture; does not exhaustively explore all MoE architectural choices
    from the paper
  • Does not explore staged SFT or curriculum learning within SFT realm
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Train larger models with higher tokens-per-parameter to determine if optimal sparsity shifts toward sparser configurations
    from the paper

Author keywords

  • Mixture of Experts
  • memorization
  • reasoning
  • scaling laws
  • large language models

Related orals

Something off? Let us know →