Academic

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

arXiv:2603.02637v1 Announce Type: cross Abstract: Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driv

arXiv:2603.02637v1 Announce Type: cross Abstract: Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder's reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.

Executive Summary

The article proposes StitchCUDA, a novel multi-agent framework for automated end-to-end GPU programming. It integrates rubric-based agentic reinforcement learning to improve the efficiency of GPU kernel generation and host-side settings. The framework achieves a nearly 100% success rate on end-to-end GPU programming tasks, outperforming baseline models in terms of speedup. This breakthrough has significant implications for the development of machine learning workloads and GPU programming.

Key Points

  • StitchCUDA is a multi-agent framework for end-to-end GPU program generation
  • It integrates rubric-based agentic reinforcement learning for improved efficiency
  • The framework achieves a nearly 100% success rate on end-to-end GPU programming tasks

Merits

Improved Efficiency

StitchCUDA's integration of rubric-based agentic reinforcement learning significantly improves the efficiency of GPU kernel generation and host-side settings.

High Success Rate

The framework achieves a nearly 100% success rate on end-to-end GPU programming tasks, demonstrating its effectiveness.

Demerits

Complexity

The multi-agent framework and reinforcement learning approach may add complexity to the development and deployment process.

Limited Generalizability

The framework's performance may be limited to specific use cases or domains, requiring further testing and validation.

Expert Commentary

The proposed StitchCUDA framework represents a significant advancement in automated end-to-end GPU programming. The integration of rubric-based agentic reinforcement learning is a novel approach that addresses the challenges of GPU kernel generation and host-side settings. While the framework demonstrates impressive performance, further research is needed to address potential limitations and ensure generalizability. The implications of this work are far-reaching, with potential impacts on machine learning workloads, GPU programming, and related applications.

Recommendations

  • Further testing and validation to ensure generalizability and robustness
  • Investigation of potential applications and use cases for the StitchCUDA framework

Sources