Autocomp：面向张量加速器的强大便携式代码优化器 (Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators)

Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.

翻译：硬件加速器，特别是为张量处理设计的加速器，在当今计算领域已无处不在。然而，尽管在编译器构建方面付出了巨大努力，对这些张量加速器进行编程仍然具有挑战性，导致其大部分潜力未被充分利用。最近，基于大量代码训练的大型语言模型（LLMs）在代码生成和优化任务中展现出巨大潜力，但生成低资源语言（如专用张量加速器代码）仍构成重大挑战。我们通过Autocomp应对这一挑战，该方法使加速器程序员能够利用领域知识和硬件反馈，通过自动化LLM驱动的搜索来优化代码。我们通过以下方式实现：1）将每个优化过程构建为结构化的两阶段提示，分为规划阶段和代码生成阶段；2）在规划阶段通过简洁且可适配的优化菜单注入领域知识；3）在每次搜索迭代中整合来自硬件的正确性和性能指标作为反馈。在三个不同的硬件平台上，我们证明Autocomp优化的代码运行速度比供应商提供的库（Gemmini）快5.6倍，比专家级手动调优代码快1.9倍（AWS Trainium），并比基于机器学习的GPU成本模型（NVIDIA L40S）实现3.8倍的性能提升。此外，我们证明Autocomp生成的优化调度方案可在相似的张量操作中复用，在固定样本预算下将加速效果提升高达24%。