Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.
翻译:大语言模型(LLMs)的生成式推理通常涉及较长的解码序列,导致累积的键值(KV)缓存产生显著的内存和延迟开销。现有的KV压缩方法主要侧重于减少长输入序列带来的预填充内存,但未能充分应对长文本生成中动态且对层敏感的特性,而这正是推理任务的核心。我们提出了Lethe,一种动态KV缓存管理框架,在解码的空间和时间维度上均引入了自适应性。在空间维度上,Lethe执行基于层的稀疏感知分配,根据估计的注意力冗余度为每个Transformer层分配令牌剪枝预算。在时间维度上,Lethe在生成过程中进行多轮令牌剪枝,其驱动力来源于一种“近期感知选择性保留”(RASR)机制。RASR扩展了传统的基于近期性的启发式方法,同时考虑了从动态演化的注意力模式中推导出的令牌相关性,从而能够就保留或淘汰哪些令牌做出明智决策。实证结果表明,Lethe在不同模型和任务上实现了效率与生成质量之间的良好平衡,将吞吐量最高提升了2.56倍。