Flashlight：加速注意力变体的PyTorch编译器扩展 (Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants)

Bad charactors when submitting to arXiv: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

翻译：arXiv提交时的错误字符：注意力是大语言模型（LLM）的基本构建模块，因此已有许多努力旨在高效实现它。例如，FlashAttention利用分块和内核融合来优化注意力计算。近期，多种注意力变体被提出以提升模型质量或效率。高效支持这些变体仍然困难，因为它们通常需要专用内核或手动调优的实现。FlexAttention最近通过使用静态编程模板支持FlashAttention类内核，部分解决了这一问题，但仅适用于注意力变体的一个子集。本文介绍Flashlight，一个PyTorch生态系统内的编译器原生框架，能够自动为任意基于注意力的程序生成融合的、FlashAttention风格的内核，无需依赖静态模板或预定义内核特化。Flashlight利用PyTorch的编译工作流程透明地融合和分块注意力计算，从而支持多样化注意力模式的高效执行。它不仅支持FlexAttention模型中可表达的所有变体，还能处理更通用的、数据依赖的注意力形式，这些超出了FlexAttention的能力范围。我们的结果表明，Flashlight生成的内核性能与FlexAttention相当或更优，同时提供原生PyTorch代码的灵活性，使开发者能够在不牺牲性能的情况下快速探索新的注意力模型。