AIMER: Calibration-Free Task-Agnostic MoE Pruning
arXiv:2603.18492v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25\% and 50\% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the
arXiv:2603.18492v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25\% and 50\% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the-art calibration-based expert pruning baselines with only 0.22--1.27 seconds for scoring the experts.
Executive Summary
This article presents AIMER, a calibration-free task-agnostic expert pruning method for Mixture-of-Experts (MoE) language models. AIMER achieves competitive or stronger performance compared to state-of-the-art calibration-based expert pruning baselines, while significantly reducing preprocessing cost and scoring time. The proposed method yields clear within-layer score separation and distinct expert stratification, making it a promising solution for large-scale MoE language models. The results demonstrate the effectiveness of AIMER across various benchmarks, including 7B to 30B parameter sizes and 25% to 50% pruning ratios. The article highlights the importance of expert pruning in reducing memory and serving overhead in MoE language models.
Key Points
- ▸ AIMER is a calibration-free task-agnostic expert pruning method for MoE language models.
- ▸ AIMER achieves competitive or stronger performance compared to calibration-based expert pruning baselines.
- ▸ The proposed method reduces preprocessing cost and scoring time significantly.
- ▸ AIMER yields clear within-layer score separation and distinct expert stratification.
Merits
Simple and Efficient
AIMER introduces a simple and efficient calibration-free criterion that reduces preprocessing cost and scoring time.
Demerits
Limited Evaluation
The article evaluates AIMER on a limited set of benchmarks, which may not fully capture its performance and generalizability.
Expert Commentary
The article presents a significant contribution to the field of expert pruning in MoE language models. AIMER's ability to achieve competitive or stronger performance without calibration is a notable achievement. However, the limited evaluation of AIMER on a small set of benchmarks raises concerns about its generalizability. Further research is needed to fully assess the potential of AIMER and its applicability to a broader range of use cases. Nevertheless, the article provides a valuable starting point for exploring the possibilities of calibration-free expert pruning in MoE language models.
Recommendations
- ✓ Future research should focus on evaluating AIMER on a more comprehensive set of benchmarks to assess its performance and generalizability.
- ✓ The development of AIMER-like methods should be continued to explore the possibilities of calibration-free expert pruning in MoE language models.