高效自适应拒绝采样：加速大语言模型推测解码的新方法 (Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models)

Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as 1 - max(P_target). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.

翻译：推测解码是一种通过使用快速草稿模型生成候选标记序列，并利用大型目标模型并行验证这些序列，以加速大语言模型自回归推理的重要技术。然而，其核心组件——拒绝采样机制——依赖于一个固定的、与上下文无关的随机阈值。这导致在高不确定性生成场景中出现显著的“随机拒绝”问题：合理的候选标记常因随机性被错误拒绝，从而损害推理效率。本文提出高效自适应拒绝采样（EARS），这是一种新颖的方法，通过引入目标模型自身的预测不确定性（量化为 1 - max(P_target)）来动态调整接受阈值。通过引入一个与该不确定性成比例的容忍项，EARS 在模型不确定时智能放宽接受标准，有效减少随机拒绝，同时在模型置信时保持严格标准。在创意写作和开放域问答任务上的实验表明，EARS 显著提升了推测解码的效率，在 GSM8K 基准测试中实现了高达 18.12% 的吞吐量提升，而准确率仅下降 0.84%。该方法无需修改模型架构，并可无缝集成到现有的推测解码框架中。