SIMPLE：将采样从GPU推理解耦为决策平面以实现更快的分布式大语言模型服务 (SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving)

As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns logits into tokens, becomes a new bottleneck. This creates a structural holdout: sampling neither expands with TP nor balances across PP stages, so its share of iteration time grows as GPUs get faster and it caps pipeline frequency at the last stage. We present SIMPLE, a stage-agnostic, sequence-parallel, overlappable decision plane that disaggregates sampling into a CPU-side service and shrinks its runtime footprint back to a minor, hidden role. SIMPLE combines: (1) sequence-parallel sampling, which shards work along the batch dimension and removes vocabulary-axis collectives; (2) a CPU-based algorithm with column-wise penalties and truncation-first filtering to realize single-pass, linear-time kernels; and (3) speculative hot-vocab sampling (SHVS), which samples on a small hot set with rejection-correctness and uses a simple sizing model to choose the hot-vocab size that maximizes throughput. In evaluation, SIMPLE improves end-to-end throughput by up to 96% and reduces P95 latency by 20-65%. Crucially, SIMPLE requires no user-side code changes and composes with existing data-plane optimizations, unlocking scaling benefits that compound with future GPU generations.

翻译：随着大语言模型（LLM）通过张量并行（TP）和流水线并行（PP）进行扩展，且生产栈已对数据平面（注意力/通用矩阵乘及KV缓存）进行了积极优化，采样——这一将逻辑值转换为词元的决策平面——正成为新的性能瓶颈。这造成了一种结构性制约：采样既无法随TP扩展，也无法在PP各阶段间均衡负载，因此随着GPU速度提升，其在迭代时间中的占比不断增加，并将流水线频率限制在最后阶段。我们提出SIMPLE，一种与阶段无关、序列并行、可重叠的决策平面，它将采样解耦为CPU侧服务，并将其运行时开销压缩至次要的隐藏角色。SIMPLE整合了：（1）序列并行采样，沿批次维度分片工作并消除词汇表轴向的集合通信；（2）基于CPU的算法，采用列向惩罚与截断优先过滤实现单遍线性时间复杂度内核；（3）推测性热词表采样（SHVS），该技术在小规模热词集上进行采样并采用拒绝校正机制，同时通过简洁的规模模型选择能最大化吞吐量的热词表尺寸。评估结果表明，SIMPLE将端到端吞吐量最高提升96%，并将P95延迟降低20-65%。关键的是，SIMPLE无需用户侧代码修改，且能与现有数据平面优化方案协同工作，从而释放出随未来GPU迭代持续累积的扩展效益。