超越完整构建的GPU内核优化：基于最小可执行程序的LLM框架 (GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs)

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large applications where full builds and runs are expensive. We present an end-to-end LLM framework with performance feedback that optimizes kernels without building the full application. From independently extracted hotspot kernels, it automatically completes code into a Minimal Executable Program (MEP), then performs multi-round iterative optimization and evaluation outside the full application. The framework integrates Automatic Error Repair and Performance Pattern Inheritance to fix faults, preserve correctness, reuse effective tiling/memory/synchronization strategies, and reduce search cost. Optimized variants are reintegrated into the original application for validation. We evaluate on NVIDIA GPUs and the Haiguang Deep Computing Unit (DCU) platform (AMD-licensed architecture) using PolyBench, the AMD APP SDK, and hotspot kernels from large-scale supercomputing applications. The method achieves average speedups of 5.05x (PolyBench on NVIDIA), 7.77x (PolyBench on DCU), 1.77x (AMD APP SDK), and 1.25x on three hotspot kernels, surpassing direct LLM optimization. The approach requires no full-source dependencies, offers cross-platform portability, and enables practical, low-cost GPU kernel optimization.

翻译：在高性能计算中，热点GPU内核是主要性能瓶颈，而专家手动调优成本高昂且难以移植。现有大语言模型方法通常假设内核可廉价编译执行，这在大型应用中并不适用，因为完整构建与运行代价极高。本文提出一种端到端的、具备性能反馈的LLM框架，可在不构建完整应用的情况下优化内核。该框架从独立提取的热点内核出发，自动将代码补全为最小可执行程序，进而在完整应用外部进行多轮迭代优化与评估。框架集成了自动错误修复与性能模式继承机制，用于修正故障、保持正确性、复用有效的分块/内存/同步策略，并降低搜索成本。优化后的变体最终被重新集成至原始应用进行验证。我们在NVIDIA GPU及海光深度计算单元平台上（基于AMD授权架构），使用PolyBench、AMD APP SDK以及大规模超算应用中的热点内核进行评估。该方法在PolyBench上分别取得平均5.05倍（NVIDIA）与7.77倍（DCU）加速，在AMD APP SDK上达1.77倍，在三个热点内核上达1.25倍，性能超越直接LLM优化。该方法无需完整源代码依赖，具备跨平台可移植性，为实现实用化、低成本的GPU内核优化提供了可行路径。