The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.
翻译:大型语言模型在长上下文推理中的可扩展性受到其Transformer键值缓存线性增长的严重制约,这带来了巨大的内存和计算开销。我们认为,随着模型生成推理标记,过去已生成标记的信息价值会逐渐降低,从而为压缩创造了机会。在本研究中,我们提出定期使用一种经过学习的专用标记对生成过程中的键值缓存进行压缩,并清除已压缩的条目。我们通过改进的联合蒸馏与强化学习框架训练模型执行这种压缩操作。该训练方法利用强化学习的输出进行蒸馏,从而在传统强化学习流程基础上实现了最小化的额外开销。实验结果表明,相较于未进行缓存压缩的模型及无需训练的压缩技术,我们的方法在内存-准确率帕累托边界上达到了更优的性能。