Non-autoregressive Transformer (NAT) is a family of text generation models, which aims to reduce the decoding latency by predicting the whole sentences in parallel. However, such latency reduction sacrifices the ability to capture left-to-right dependencies, thereby making NAT learning very challenging. In this paper, we present theoretical and empirical analyses to reveal the challenges of NAT learning and propose a unified perspective to understand existing successes. First, we show that simply training NAT by maximizing the likelihood can lead to an approximation of marginal distributions but drops all dependencies between tokens, where the dropped information can be measured by the dataset's conditional total correlation. Second, we formalize many previous objectives in a unified framework and show that their success can be concluded as maximizing the likelihood on a proxy distribution, leading to a reduced information loss. Empirical studies show that our perspective can explain the phenomena in NAT learning and guide the design of new training methods.
翻译:非偏向式变换器(NAT)是一个文本生成模型的组合,目的是通过同时预测整个句子来减少解码延迟,然而,这种延缓式降低会牺牲捕捉左对右依赖性的能力,从而使NAT学习非常富有挑战性。在本文中,我们提出理论和经验分析,以揭示NAT学习的挑战,并提出统一的观点来理解现有的成功。首先,我们表明,仅仅通过尽可能扩大可能性来培训NAT,就会导致边际分布接近,但会降低所有代号之间的依赖性,而下降的信息可以用数据集的有条件总关联度来衡量。第二,我们把先前的许多目标正式化为统一框架,并表明其成功可以作为在代理分布上实现最大可能性,导致减少信息损失。经验性研究表明,我们的观点可以解释NAT学习中的现象,并指导新培训方法的设计。