Flashlight：PyTorch编译器扩展以加速注意力机制变体 (Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants)

Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

翻译：注意力机制是大型语言模型（LLM）的基本构建模块，因此已有许多工作致力于高效实现它。例如，FlashAttention利用分块和内核融合来优化注意力计算。近年来，为提升模型质量或效率，多种注意力变体被提出。高效支持这些变体仍然困难，因为它们通常需要专用内核或手动调优的实现。FlexAttention近期通过使用静态编程模板为部分注意力变体提供类似FlashAttention的内核支持，部分解决了这一缺口。本文提出Flashlight，一个PyTorch生态系统内的编译器原生框架，能自动为任意基于注意力的程序生成融合的、FlashAttention风格的内核，无需依赖静态模板或预定义内核特化。Flashlight利用PyTorch的编译工作流程透明地融合和分块注意力计算，从而支持多样化注意力模式的高效执行。它不仅支持FlexAttention模型可表达的所有变体，还能处理更通用的、数据依赖的注意力形式化表示，这些超出了FlexAttention的能力范围。实验结果表明，Flashlight生成的内核性能与FlexAttention相当或更优，同时提供原生PyTorch代码的灵活性，使开发者能在不牺牲性能的前提下快速探索新的注意力模型。