The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.
翻译:键值(KV)缓存是基于Transformer的大型语言模型(LLM)效率的核心,它存储先前计算得到的向量以加速推理。然而,随着序列长度和批处理规模的增加,缓存成为主要的内存瓶颈。现有的压缩方法通常仅对键进行低秩分解,或尝试联合嵌入查询和键,但这两种方法都忽略了注意力机制根本上依赖于它们的内积。在本工作中,我们证明了此类策略在近似注意力矩阵方面是次优的。我们提出了KQ-SVD,这是一种简单且计算高效的方法,它通过闭式解直接对注意力矩阵进行最优低秩分解。通过针对冗余的真实来源,KQ-SVD在压缩下以更高的保真度保留了注意力输出。在LLaMA和Mistral模型上的广泛评估表明,我们的方法始终能提供更优的投影质量。