Academic

Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models

arXiv:2604.01622v1 Announce Type: new Abstract: Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return.

arXiv:2604.01622v1 Announce Type: new Abstract: Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at https://github.com/zhangshuibai/EC-DLM.

Executive Summary

This study presents a novel routing mechanism for diffusion language models (DLMs) called expert-choice (EC) routing, which offers several advantages over traditional token-choice (TC) routing. EC routing facilitates deterministic load balancing, resulting in higher throughput and faster convergence. The researchers also introduce timestep-dependent expert capacity, where the allocation of computational resources varies depending on the denoising step. This approach leads to improved performance and accuracy across diverse downstream tasks. The study demonstrates that existing pretrained TC DLMs can be retrofitted to EC, making it a more efficient and adaptable paradigm for DLM mixture-of-experts models.

Key Points

  • Expert-choice routing provides deterministic load balancing, improving throughput and convergence in DLMs.
  • Timestep-dependent expert capacity leads to improved performance and accuracy across downstream tasks.
  • Existing pretrained TC DLMs can be retrofitted to EC, making it a more adaptable paradigm for DLM MoE models.

Merits

Strength in Adaptive Computation

EC routing allows for adaptive computation allocation, leading to improved performance and accuracy in DLMs.

Deterministic Load Balancing

EC routing provides deterministic load balancing, reducing load imbalance and improving throughput in DLMs.

Demerits

Limited Scope

The study focuses primarily on DLMs and may not be directly applicable to other types of models or applications.

Complexity of Implementation

Introducing timestep-dependent expert capacity may add complexity to the implementation of EC routing, potentially requiring significant modifications to existing models.

Expert Commentary

The study presents a significant contribution to the field of NLP, offering a novel routing mechanism that addresses the limitations of traditional TC routing in DLMs. The introduction of timestep-dependent expert capacity is a particularly noteworthy innovation, as it enables adaptive computation allocation and leads to improved performance and accuracy. While the study's focus on DLMs may limit its immediate impact, the findings have far-reaching implications for the development of future NLP models and architectures. Moreover, the retrofitting of existing pretrained TC DLMs to EC routing offers a practical solution for improving the performance of existing models, making this study a valuable resource for researchers and practitioners alike.

Recommendations

  • Further research should be conducted to explore the applicability of EC routing to other types of models and applications.
  • The development of more efficient and scalable implementations of EC routing will be essential for widespread adoption in NLP applications.

Sources

Original: arXiv - cs.LG