Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.
翻译:扩散模型在语言建模领域展现出巨大潜力,相比传统自回归方法具有多方面优势。其并行生成与修订完整响应的能力,实现了更快的生成速度与内置的自校正机制。当前多数基于扩散的语言模型采用掩码扩散策略,其解码过程需基于二元决策迭代处理掩码标记:保留掩码或替换为预测标记。然而这种二元选择在保留掩码时会丢弃有价值的预测信息。为克服此局限,我们提出软掩码(SM)方法——针对每个保留的掩码,动态融合掩码标记的嵌入向量与上一解码步骤中前$k$个预测标记的嵌入向量。该方法为模型提供了信息更丰富的先验分布,既保留了早期计算中的上下文信息,又允许掩码标记的部分信息跨越多步传播。我们提出适配预训练掩码扩散语言模型的SM训练方法。实验表明,对1.69亿参数模型进行SM持续预训练可提升困惑度与MAUVE分数。此外,我们将SM应用于两个前沿扩散模型Dream-7B和Dream-Coder-7B的微调。SM在多项编程基准测试中持续提升模型性能,在高吞吐量场景下表现尤为显著。