Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code is available at https://github.com/Longxmas/SlimInfer.
翻译:长上下文大语言模型(LLM)的推理过程因高计算需求而受到严重限制。尽管现有多种方法优化了注意力计算,但它们仍需要在每一层处理完整的隐藏状态集合,从而限制了整体效率。本文提出SlimInfer,一种创新框架,旨在通过在前向传播过程中直接剪枝掉重要性较低的提示令牌来加速推理。我们的核心洞察是一种信息扩散现象:随着关键令牌的信息在层间传播,它会分布到整个序列中。这一扩散过程表明,即使在隐藏状态中剪除冗余令牌(包括部分关键令牌),LLM仍能保持其语义完整性。基于此,SlimInfer引入了一种动态细粒度剪枝机制,能够精确移除中间层隐藏状态中的冗余令牌。这种逐层剪枝自然支持一种异步KV缓存管理器,无需复杂预测器即可预取所需的令牌块,从而降低内存使用和I/O开销。大量实验表明,在单张RTX 4090上,SlimInfer为LLaMA3.1-8B-Instruct模型实现了高达$\mathbf{2.53\times}$的首令牌生成时间(TTFT)加速和$\mathbf{1.88\times}$的端到端延迟降低,且在LongBench基准测试中未牺牲性能。代码已开源:https://github.com/Longxmas/SlimInfer。