DynaSpec：面向大词汇量语言模型的上下文感知动态推测采样 (DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models)

Speculative decoding has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed top frequent subset of the target model's vocabulary. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. Across standard speculative decoding benchmarks, DynaSpec delivers consistent improvements in mean accepted length, for Llama-3-8B, reaching upto 98.2% of full-vocabulary performance, while fixed-shortlist baselines attain only 84.4%. By leveraging context-dependent selection, DynaSpec achieves up to a 2.18 times increase in generated tokens compared to 1.91 times for fixed-vocabulary approaches.

翻译：推测解码已成为加速大语言模型推理的标准方法：小型草稿模型提出多个候选词元，大型目标模型按推测长度一次性验证它们。近年来，大语言模型词汇量的扩展推动词元数量大幅增长。虽然在全词汇表上进行验证对目标模型影响不大，但草稿模型输出头中O(|V|d)规模的参数成为延迟瓶颈，拖慢了整个流水线。现有方法（如FR-Spec、VocabTrim）将草稿模型的词汇表限制在目标模型词汇表的固定高频子集。尽管这减少了草稿阶段计算量，但方法脆弱，因为：（i）词频列表依赖语料库且需重新调参才能泛化；（ii）静态短列表会抑制罕见词或领域特定词元，降低每验证步骤的预期接受词元数。我们提出DynaSpec，一种上下文依赖的动态短列表机制，具有鲁棒性、能加速草稿生成，并可跨多样化任务泛化。具体而言，我们引入轻量级粗粒度元分类器，将上下文路由至少量词元簇；所选前k个簇的并集构成草稿模型的短列表，而验证阶段保留完整词汇表与精确性。元分类器通过在不同流上并行执行草稿编码与元短列表生成，其计算完成时间早于草稿模型隐藏状态生成。在标准推测解码基准测试中，DynaSpec在平均接受长度指标上实现持续改进：对于Llama-3-8B模型，达到全词汇表性能的98.2%，而固定短列表基线仅达84.4%。通过利用上下文依赖选择机制，DynaSpec的生成词元数量最高提升至2.18倍，而固定词汇表方法仅为1.91倍。