Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
翻译:推测解码已成为一种流行的方法,用于加速大型语言模型(LLMs)的推理过程,同时保持其优越的文本生成性能。先前的方法要么采用固定的推测解码配置而不考虑前缀标记,要么通过离线或在线方式训练草稿模型以使其与上下文对齐。本文提出了一种免训练的在线学习框架,能够在文本生成过程中自适应地选择推测解码的超参数配置。我们首先将该超参数选择问题形式化为一个多臂老虎机问题,并提出了一个通用的推测解码框架BanditSpec。此外,我们设计并分析了两种基于老虎机算法的超参数选择算法——UCBSpec和EXP3Spec,通过引入一个新的度量指标——停止时间遗憾,对其进行了理论分析。我们在随机奖励和对抗性奖励两种设定下,给出了该遗憾的上界。通过推导一个信息论上的不可能性结果,我们证明了UCBSpec的遗憾性能在通用常数范围内是最优的。最后,基于LLaMA3和Qwen2的大量实证实验表明,与现有方法相比,我们的算法是有效的;在模拟真实LLM服务场景中,面对多样化的输入提示,其吞吐量接近使用最优超参数的预言机方法。