Astra：一种用于GPU内核性能优化的多智能体系统 (Astra: A Multi-Agent System for GPU Kernel Performance Optimization)

GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining high performance typically requires extensive manual tuning. Compiler-based systems reduce some of this burden, but still demand substantial manual design and engineering effort. Recently, researchers have explored using LLMs for GPU kernel generation, though prior work has largely focused on translating high-level PyTorch modules into CUDA code. In this work, we introduce Astra, the first LLM-based multi-agent system for GPU kernel optimization. Unlike previous approaches, Astra starts from existing CUDA implementations extracted from SGLang, a widely deployed framework for serving LLMs, rather than treating PyTorch modules as the specification. Within Astra, specialized LLM agents collaborate through iterative code generation, testing, profiling, and planning to produce kernels that are both correct and high-performance. On kernels from SGLang, Astra achieves an average speedup of 1.32x using zero-shot prompting with OpenAI o4-mini. A detailed case study further demonstrates that LLMs can autonomously apply loop transformations, optimize memory access patterns, exploit CUDA intrinsics, and leverage fast math operations to yield substantial performance gains. Our work highlights multi-agent LLM systems as a promising new paradigm for GPU kernel optimization. Our code is publicly available at https://github.com/Anjiang-Wei/Astra.

翻译：GPU内核优化长期以来一直是高性能计算与机器学习交叉领域的核心挑战。高效的内核对于加速大语言模型（LLM）的训练与推理至关重要，然而获得高性能通常需要大量手动调优。基于编译器的系统虽减轻了部分负担，但仍需大量人工设计与工程投入。近期，研究者开始探索利用LLM进行GPU内核生成，但现有工作主要聚焦于将高级PyTorch模块转换为CUDA代码。本研究提出Astra，首个基于LLM的多智能体GPU内核优化系统。与先前方法不同，Astra以从广泛部署的LLM推理框架SGLang中提取的现有CUDA实现为起点，而非将PyTorch模块作为规范。在Astra中，专业化的LLM智能体通过迭代式代码生成、测试、性能剖析与规划进行协作，生成兼具正确性与高性能的内核。在SGLang的内核测试中，Astra使用OpenAI o4-mini进行零样本提示，平均实现了1.32倍的加速比。详细案例研究进一步表明，LLM能够自主应用循环变换、优化内存访问模式、利用CUDA内置函数并借助快速数学运算，从而带来显著的性能提升。我们的工作凸显了多智能体LLM系统作为GPU内核优化新范式的潜力。代码已公开于https://github.com/Anjiang-Wei/Astra。