We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.
翻译:我们提出并行令牌预测(PTP),一种用于语言模型并行序列生成的通用框架。PTP通过将采样过程整合到模型中,在单次Transformer调用中联合预测多个相互依赖的令牌。这降低了自回归解码的延迟瓶颈,并避免了现有多令牌预测方法中常见的限制性独立假设。我们证明PTP能够表示任意的自回归序列分布。PTP可通过蒸馏现有模型或无教师逆向自回归训练进行训练。实验表明,我们在Spec-Bench上以每步接受超过四个令牌的速率,在Vicuna-7B模型上实现了最先进的推测解码性能。该框架的通用性表明,长序列的并行生成在不损失建模能力的情况下是可行的。