ESS：面向DeepSeek-V3.2-Exp的以卸载为中心的潜在缓存管理架构 (ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp)

DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput. To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby reducing deployment costs in real-world settings. Our high-fidelity simulations show that ESS delivers 69.4\% throughput improvement at 32K context length and up to 123\% throughput improvement at 128K, demonstrating its effectiveness for large-context inference workloads. These results highlight ESS as a practical and scalable solution for long-context LLM serving.

翻译：DeepSeek-V3.2-Exp引入了一种稀疏注意力机制，在长上下文场景下显著降低了推理延迟。尽管整体吞吐量已大幅提升，但PD解耦的Decode阶段仍是主要瓶颈。该瓶颈主要源于潜在缓存随序列长度线性增长与GPU内存容量有限之间的矛盾，这限制了可行的批处理规模，从而抑制了Decode阶段的吞吐量。为解决这一挑战，我们提出了ESS（扩展稀疏服务器），一种专为DeepSeek-V3.2-Exp设计的以卸载为中心的系统架构。ESS选择性地将潜在缓存卸载至CPU内存，同时将延迟关键组件保留在GPU上。通过释放GPU内存，ESS有效实现了批处理规模扩展与GPU内存约束的解耦。该设计显著提升了Decode阶段吞吐量，从而降低了实际部署成本。我们的高保真仿真表明，在32K上下文长度下ESS可带来69.4%的吞吐量提升，在128K长度下最高可提升123%，证明了其在大规模上下文推理任务中的有效性。这些结果凸显了ESS作为长上下文大语言模型服务的实用且可扩展的解决方案。