In this work, we present WaveFlow, a small-footprint generative flow for raw audio, which is trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet. It provides a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow as special cases. We systematically study these likelihood-based generative models for raw waveforms in terms of test likelihood and speech fidelity. We demonstrate that WaveFlow can synthesize high-fidelity speech as WaveNet, while only requiring a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, WaveFlow closes the significant likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has 5.91M parameters and can generate 22.05kHz high-fidelity speech 42.6 times faster than real-time on a GPU without engineered inference kernels.
翻译:在这项工作中,我们介绍了WaveFlow,这是原始音频的一种小脚印基因流,经过培训后,在没有概率密度蒸馏和辅助损失的情况下,最有可能地进行平行波网和ClariNet中使用了WaveFlow。它统一了原始音频的概率模型,包括WaveNet和WaveGlow作为特例。我们系统地研究这些原始波形的基于可能性的基因模型,从测试可能性和言词忠诚性的角度来说。我们证明WaveFlow可以将高纤维语言合成成WaveNet,而只需要几个连续步骤来生成非常长的波形,有数十万个时间步骤。此外,WaveFlow缩小了自动递增模型和流动模型之间为高效合成而存在的巨大可能性差距。最后,我们的小脚印WaveFlow有5.91M参数,可以生成22.05kHz高纤维语言演讲42.6倍于没有工程发酵内核的GPU的实时话。