AdaSPEC：面向高效推测解码器的选择性知识蒸馏 (AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders)

Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.

翻译：推测解码（Speculative Decoding，SD）通过采用一个小型草稿模型生成预测，再由一个更大的目标模型进行验证，从而加速大语言模型的推理过程。SD的有效性取决于这两个模型之间的对齐程度，通常通过知识蒸馏（Knowledge Distillation，KD）来增强。然而，传统的KD方法旨在最小化草稿模型与目标模型在所有词元上的KL散度，这一目标与SD的真实目标——最大化词元接受率——并不一致。因此，由于容量限制，草稿模型往往难以完全吸收目标模型的知识，导致性能欠佳。为解决这一挑战，我们提出了AdaSPEC，一种在KD过程中引入选择性词元过滤的新方法。AdaSPEC利用一个参考模型来识别并过滤难以拟合的词元，从而能够在更简单的词元上蒸馏出与目标模型对齐更好的草稿模型。该方法在不损害生成质量的前提下，提高了整体词元接受率。我们在包括算术推理、指令遵循、代码生成和文本摘要在内的多样化任务上评估了AdaSPEC，使用的模型参数配置为31M/1.4B和350M/2.7B。实验结果表明，AdaSPEC在所有任务上均持续优于当前最先进的DistillSpec方法，实现了更高的接受率（最高提升15%）。代码已在 https://github.com/yuezhouhu/adaspec 公开。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日