Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
翻译:利用注意力稀疏性加速长上下文大语言模型(LLMs)已成为研究热点。然而,现有算法如稀疏注意力或键值(KV)缓存压缩通常采用固定预算,这在部署时面临显著挑战,因其未能考虑现实场景的动态特性——其中精度与效率的最优平衡可能存在巨大差异。本文发现,将Top-$p$采样(核采样)思想引入稀疏注意力可意外实现自适应预算分配。基于此,我们提出Twilight框架,该框架能在不牺牲现有稀疏注意力算法精度的前提下,为任意算法引入自适应稀疏性。实验结果表明,Twilight能自适应地剪枝最多98%的冗余标记,在长上下文LLM解码中实现自注意力操作$15.4\\times$的加速以及端到端每标记延迟$3.9\\times$的加速。