Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
翻译:扩散语言模型(dLMs)已成为一种有前景的范式,能够实现并行、非自回归的生成,但其从头训练时的学习效率落后于自回归(AR)语言模型。为此,我们研究了AR到dLM的转换方法,将预训练的AR模型转化为高效的dLM,这些模型在保持AR模型任务准确性的同时,在速度上表现出色。我们通过识别现有AR到dLM方法在注意力模式和目标函数上的局限性,提出了更有效的AR到dLM转换原则和方法论。具体而言,我们首先系统比较了不同的注意力模式,发现保持预训练AR权重分布对于有效的AR到dLM转换至关重要。因此,我们引入了一种基于块状注意力模式的持续预训练方案,该方案在块间保持因果性,同时在每个块内实现双向建模。我们发现,与完全双向建模相比,这种方法能更好地保留预训练AR模型的权重分布,此外还具有支持KV缓存的已知优势,从而在准确性和效率上实现双赢。其次,为了缓解掩码标记分布(训练时的均匀分布与测试时的高度从左到右分布)之间的训练-测试差距,我们提出了一种位置相关的标记掩码策略,在训练时为后续标记分配更高的掩码概率,以更好地模拟测试时的行为。利用这一框架,我们对dLM的注意力模式、训练动态及其他设计选择进行了广泛研究,为可扩展的AR到dLM转换提供了可操作的见解。这些研究催生了Efficient-DLM系列模型,其性能超越了最先进的AR模型和dLM,例如,我们的Efficient-DLM 8B模型相比Dream 7B和Qwen3 4B,分别实现了+5.4%/+2.7%的更高准确性和4.5倍/2.7倍的更高吞吐量。