WOLONet:高效和高忠心演说综合 " 浪潮展望 " (WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis)

Recently, GAN-based neural vocoders such as Parallel WaveGAN, MelGAN, HiFiGAN, and UnivNet have become popular due to their lightweight and parallel structure, resulting in a real-time synthesized waveform with high fidelity, even on a CPU. HiFiGAN and UnivNet are two SOTA vocoders. Despite their high quality, there is still room for improvement. In this paper, motivated by the structure of Vision Outlooker from computer vision, we adopt a similar idea and propose an effective and lightweight neural vocoder called WOLONet. In this network, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights. To demonstrate the effectiveness and generalizability of our method, we perform an ablation study to verify our novel design and make a subjective and objective comparison with typical GAN-based vocoders. The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, HiFiGAN and UnivNet.

翻译：最近,基于GAN的Neal vocation vocular vocideers,如Plaine WaveGAN、MelGAN、HiFiGAN和UnivNet等基于GAN的神经立体最近由于它们的轻重和平行结构而变得很受欢迎,导致一个实时合成的波形,具有高度忠诚,甚至在CPU上也是如此。HiFiGAN和UnivNet是两个SOTA的立体。尽管它们质量很高,但仍有改进的余地。在本文中,由于计算机视野展望者的结构,我们采用了类似的想法,并提出了一个有效和轻量的神经伏变体,称为WOOLONet。在这个网络中,我们开发了一个新型的轻质区块,使用一个位置可变的、不依赖频道的和深度动态共振动内核,并配有正态的动态内核重量。为了展示我们的方法的有效性和可概括性,我们进行了一项相关研究,以核实我们的新设计,并与典型的GAN基电动的电动电动电动电动电动电解器进行主观和客观的比较。结果显示,我们的WOLOLONet在需要比UFIFAG低的参数和NG的参数小于G。