We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. Additionally, we present insights about the different choices of context - modalities, lateral information and length - for audiobook TTS that have never been discussed in the literature before.
翻译:我们提出了一个多讲日本音频书文本到语音(TTS)系统,利用以前音响背景和双边文字背景的多式联运背景信息来改进合成言语的传动。以前的工作使用单边或单一方式背景,并不完全代表背景信息。拟议方法使用声频背景编码器和文字背景编码器来汇总背景信息并将其输入TTS模型,使该模型能够预测根据背景进行演动。我们对多语种日本音频书数据集进行了全面客观和主观的评价。实验结果表明,拟议方法大大超过前两次工作。此外,我们介绍了以前从未在文献中讨论过的对音频TTS的不同背景选择――方式、横向信息和长度――的见解。