Academic

AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

arXiv:2603.23566v1 Announce Type: new Abstract: AscendC (Ascend C) operator optimization on Huawei Ascend neural processing units (NPUs) faces a two-fold knowledge bottleneck: unlike the CUDA ecosystem, there are few public reference implementations to learn from, and performance hinges on a coupled two-part artifact - a host-side tiling program that orchestrates data movement and a kernel program that schedules and pipelines instructions. We present AscendOptimizer, an episodic agent that bootstraps this missing expertise by turning execution into experience. On the host side, AscendOptimizer performs profiling-in-the-loop evolutionary search to discover valid and high-performing tiling and data-movement configurations directly from hardware feedback. On the kernel side, it mines transferable optimization motifs by rewinding optimized kernels - systematically de-optimizing them to synthesize instructive "bad-to-good" trajectories - and distills these motifs into a retrievable experie

arXiv:2603.23566v1 Announce Type: new Abstract: AscendC (Ascend C) operator optimization on Huawei Ascend neural processing units (NPUs) faces a two-fold knowledge bottleneck: unlike the CUDA ecosystem, there are few public reference implementations to learn from, and performance hinges on a coupled two-part artifact - a host-side tiling program that orchestrates data movement and a kernel program that schedules and pipelines instructions. We present AscendOptimizer, an episodic agent that bootstraps this missing expertise by turning execution into experience. On the host side, AscendOptimizer performs profiling-in-the-loop evolutionary search to discover valid and high-performing tiling and data-movement configurations directly from hardware feedback. On the kernel side, it mines transferable optimization motifs by rewinding optimized kernels - systematically de-optimizing them to synthesize instructive "bad-to-good" trajectories - and distills these motifs into a retrievable experience bank for guided rewriting. By alternating host tuning and kernel rewriting in a closed loop, AscendOptimizer steadily expands feasibility and pushes latency down. On a benchmark of 127 real AscendC operators, AscendOptimizer achieves a 1.19x geometric-mean speedup over the open-source baseline, with 49.61% of operators outperforming their references, outperforming strong agent and search baselines.

Executive Summary

The article presents AscendOptimizer, an episodic agent designed to optimize Huawei Ascend NPU operators. By leveraging hardware feedback, AscendOptimizer discovers valid and high-performing tiling and data-movement configurations, and mines transferable optimization motifs from optimized kernels. The agent achieves a 1.19x geometric-mean speedup over an open-source baseline and outperforms strong agent and search baselines on a benchmark of 127 real AscendC operators. This work addresses the knowledge bottleneck in AscendC operator optimization and demonstrates the potential of episodic learning in compiler optimization. The discovery of effective optimization motifs and the closed-loop tuning approach showcase the adaptability and efficiency of AscendOptimizer.

Key Points

  • AscendOptimizer addresses the knowledge bottleneck in AscendC operator optimization using episodic learning.
  • The agent discovers valid and high-performing tiling and data-movement configurations through profiling-in-the-loop evolutionary search.
  • AscendOptimizer mines transferable optimization motifs from optimized kernels and distills them into a retrievable experience bank.

Merits

Adaptability and Efficiency

AscendOptimizer's closed-loop tuning approach and episodic learning capabilities enable adaptability and efficiency in discovering effective optimization motifs and configurations.

Improved Performance

The agent achieves a 1.19x geometric-mean speedup over the open-source baseline and outperforms strong agent and search baselines on a benchmark of 127 real AscendC operators.

Addressing Knowledge Bottleneck

AscendOptimizer addresses the knowledge bottleneck in AscendC operator optimization by leveraging hardware feedback and discovering effective optimization motifs.

Demerits

Limited Generalizability

The article's focus on Huawei Ascend NPU operators may limit the generalizability of AscendOptimizer to other types of neural processing units.

Dependence on Hardware Feedback

AscendOptimizer's effectiveness depends on the availability and quality of hardware feedback, which may not be universally available or reliable.

Expert Commentary

AscendOptimizer's innovative approach to NPU operator optimization demonstrates the potential of episodic learning in addressing knowledge bottlenecks. The discovery of effective optimization motifs and the closed-loop tuning method showcase the adaptability and efficiency of the agent. While limitations in generalizability and dependence on hardware feedback exist, AscendOptimizer's performance improvements and potential applications make it an exciting area of research. Further investigation into the agent's performance on diverse NPU architectures and optimization tasks will be crucial in determining its wider applicability.

Recommendations

  • Future research should focus on extending AscendOptimizer's episodic learning approach to other areas, such as software optimization and robotics.
  • The development of AscendOptimizer highlights the need for a standardized hardware feedback framework to ensure widespread adoption and reliability.

Sources

Original: arXiv - cs.LG