Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing
arXiv:2603.11535v1 Announce Type: new Abstract: Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.
arXiv:2603.11535v1 Announce Type: new Abstract: Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.
Executive Summary
The article introduces Expert Threshold (ET) routing as an innovative alternative to Token-Choice Mixture-of-Experts (TC-MoE) for autoregressive language modeling. While TC-MoE restricts dynamic computation allocation by routing tokens to a fixed number of experts and relies on auxiliary losses for load balance, ET routing leverages an exponential moving average (EMA) threshold per expert, dynamically computed from the global token distribution. This mechanism enables efficient, causal, token-independent routing at both training and inference without auxiliary losses, thereby improving load balance and reducing computational overhead. Empirical results on a 2.4B parameter model on FineWeb-Edu demonstrate ET’s superiority, achieving a 0.067 lower cross-entropy loss—equivalent to achieving comparable performance with 1.6x fewer tokens. These findings suggest a significant advancement in scalable and efficient autoregressive modeling.
Key Points
- ▸ ET routing replaces TC-MoE’s fixed routing with dynamic, threshold-based expert selection
- ▸ EMA thresholds are updated globally and enable causal, independent token routing
- ▸ Empirical validation shows improved efficiency and performance without auxiliary losses
Merits
Performance Improvement
ET achieves lower cross-entropy loss than TC-MoE while reducing token overhead, indicating superior efficiency in resource utilization
Demerits
Implementation Complexity
Establishing and maintaining accurate EMA thresholds at scale may introduce computational or synchronization challenges, particularly in distributed training environments
Expert Commentary
The ET routing mechanism represents a nuanced but impactful shift in the design of Mixture-of-Experts architectures for autoregressive systems. By replacing auxiliary losses with a causal, threshold-driven routing mechanism, the authors effectively decouple load balancing from computational dependency on batch-level token co-occurrence. This is particularly advantageous in autoregressive modeling, where tokens are processed sequentially and inter-token dependencies are inherently causal. The use of EMA thresholds—leveraging global distribution statistics without requiring per-token feedback loops—is both elegant and pragmatic. Importantly, the empirical gains reported (0.067 loss reduction equivalent to 1.6x token savings) suggest that ET does not merely offer incremental improvement but potentially redefines the baseline for scalable expert routing. However, the article should have addressed potential variance in threshold accuracy under non-stationary distributions or under adversarial token patterns, which could affect generalizability. Overall, ET routing provides a robust, scalable alternative to TC-MoE and warrants further investigation in real-world deployment scenarios.
Recommendations
- ✓ Adopt ET routing in next-gen autoregressive models as a default routing strategy where load balancing and dynamic computation are critical
- ✓ Conduct comparative studies under non-stationary input distributions or adversarial training scenarios to evaluate threshold robustness