While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.
翻译:虽然音质文本到语音(TTS)已经实现了人性化自然合成言语,多语言TTS系统由于需要配对文本和工作室质量的音频数据,局限于资源丰富的语言。本文件建议了一种方法,用于使用目标语言的纯文本数据,零点多语言TTS。只使用文本数据,就可以为只有文字资源的低资源语言开发TTS系统,使数千种语言能够使用TTS。在多种语言模式的强大跨语言可传输性激励下,我们的框架首先用多语言文本数据进行隐蔽语言模型预培训。然后,我们以监督的方式用配对数据来培训这个模型,同时冻结语言认知嵌入层。这允许人们推断即使在配对数据中不包括但只文本数据中存在的语言。评价结果显示,对于一种看不见的语言,其性能差率低于12%的零点TSTS。所有实验都是利用公共数据集进行的,实施过程将允许人们重新理解。