Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.
翻译:视觉语言大模型在处理高分辨率输入时面临显著的效率挑战。注意力机制的自二次复杂度与自回归生成过程,以及持续增长的关键值缓存规模,严重制约了预填充和解码阶段的性能。近期研究尝试通过识别并剪枝重要性较低的token来压缩KV缓存,但这些方法通常依赖注意力分数来评估token重要性,导致其无法与FlashAttention、稀疏注意力等高效注意力机制兼容,因为这些机制并不显式计算注意力矩阵。此外,现有方法忽视了稀疏注意力在加速预填充阶段的同时,会改变KV缓存的信息结构,从而影响下游KV缓存压缩策略的有效性。为解决这一问题,我们提出PureKV——一个即插即用的稀疏注意力与KV缓存压缩联合优化框架。我们首先提出了一种完全兼容高效注意力加速器的KV缓存压缩策略。该方法利用底层注意力分数来估计高层KV缓存的重要性,在保证精度的前提下实现主动剪枝。此外,我们专门为视频KV缓存压缩算法设计了空间-时间稀疏注意力模块。该模块结合空间与时间注意力稀疏性,通过净化KV缓存中的空间噪声与时间冗余,提升KV缓存优化算法的压缩效率。同时,ST-SpAttn模块也加速了VLLMs的预填充阶段。在多种视觉语言大模型上的大量实验表明,PureKV在几乎不损失生成质量的前提下,实现了5.0倍的KV缓存压缩与3.16倍的预填充加速。