DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput. To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby reducing deployment costs in real-world settings. Our high-fidelity simulations show that ESS delivers 69.4\% throughput improvement at 32K context length and up to 123\% throughput improvement at 128K, demonstrating its effectiveness for large-context inference workloads. These results highlight ESS as a practical and scalable solution for long-context LLM serving.
翻译:DeepSeek-V3.2-Exp引入了一种稀疏注意力机制,在长上下文场景下显著降低了推理延迟。尽管整体吞吐量已大幅提升,但PD解耦的Decode阶段仍是主要瓶颈。该瓶颈主要源于潜在缓存随序列长度线性增长与GPU内存容量有限之间的矛盾,这限制了可行的批处理规模,从而抑制了Decode阶段的吞吐量。为解决这一挑战,我们提出了ESS(扩展稀疏服务器),一种专为DeepSeek-V3.2-Exp设计的以卸载为中心的系统架构。ESS选择性地将潜在缓存卸载至CPU内存,同时将延迟关键组件保留在GPU上。通过释放GPU内存,ESS有效实现了批处理规模扩展与GPU内存约束的解耦。该设计显著提升了Decode阶段吞吐量,从而降低了实际部署成本。我们的高保真仿真表明,在32K上下文长度下ESS可带来69.4%的吞吐量提升,在128K长度下最高可提升123%,证明了其在大规模上下文推理任务中的有效性。这些结果凸显了ESS作为长上下文大语言模型服务的实用且可扩展的解决方案。