Recently, neural vocoders have been widely used in speech synthesis tasks, including text-to-speech and voice conversion. However, in the encounter of data distribution mismatch between training and inference, neural vocoders trained on real data often degrade in voice quality for unseen scenarios. In this paper, we train three commonly used neural vocoders, including WaveNet, WaveRNN, and WaveGlow, alternately on five different datasets. To study the robustness of neural vocoders, we evaluate the models using acoustic features from seen/unseen speakers, seen/unseen languages, a text-to-speech model, and a voice conversion model. In this work, we found that WaveNet is more robust than WaveRNN, especially in the face of inconsistency between training and testing data. Through our experiments, we show that WaveNet is more suitable for text-to-speech models, and WaveRNN more suitable for voice conversion applications. Furthermore, we present results with considerable reference value of subjective human evaluation for future studies.
翻译:最近,在语音合成任务中,包括文字到语音和语音转换中,神经电动器被广泛用于语音合成任务。然而,在遇到培训和推断之间数据分布不匹配的情况时,在真实数据质量方面受过培训的神经电动器常常在声音质量方面为看不见的情景而退化。在本文中,我们培训了三种常用的神经电动器,包括WaveNet、WaveRNNN和WaveGlow,在五个不同的数据集上轮流使用。为了研究神经电动器的稳健性,我们利用从可见/看不见的扬声器、可见/看不见的语言、文本到语音模型和语音转换模型的声学特征来评估模型。此外,我们提出未来研究的主观人类评价具有相当大的参考价值。