Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao

LLMs & Reasoning Fri, Apr 24 · 3:39 PM–3:49 PM · Amphitheater Avg rating: 6.67 (6–8)

Abstract

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain intermediate activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on $n^2$ activations, where $n$ is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Expert-Router Coupling loss tightly couples MoE router decisions with expert capabilities by treating router embeddings as proxy tokens.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes ERC loss that enforces each expert exhibits higher activation for its own proxy token than others
Ensures each proxy token elicits stronger activation from corresponding expert than from others
ERC loss operates only on n-squared activations independent of batch size, more efficient than prior coupling methods

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Expert-router coupling loss
Mixture-of-Experts
Proxy token mechanism

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Mixture-of-Experts
Large language models
Auxiliary loss
Expert-router coupling
Expert specialization

Something off? Let us know →

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis