The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue. Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.
翻译:最近的文本到语音(TTS)的质量与人类相当;然而,在口语对话中的应用还没有得到广泛研究。 本研究的目的是实现一个与人类对话非常相似的TTS。 首先,我们记录和转录实际自发对话。然后,拟议的对话TTS分两个阶段培训:第一阶段,变式自动读数(VAE)-VITS或高西亚混合自动变式自动调数(GMVAE)-VITS经过培训,通过对终端到终端文本到语音(VITS)的对抗性学习,将发音级别潜伏变异性引入变异性推论,这是最近提议的终端到终端到语音(VITS)模式。从演讲中提取潜在发言风格的样式编码器与TSTS联合培训。在第二阶段,对风格预测器进行了培训,以预测从对话历史中合成的演讲风格。在推论期间,通过将风格预测器预测的语音风格表达到VAE/GVAVE-VIE-VITS,演讲可以将拟议自然对话的原始结果综合到适当的自然对话方式。