Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations, which disregard the inherently continuous nature of speech and can lead to loss of fine-grained acoustic information. In this work, we investigate the TTS within the MLLM paradigm using continuous speech representations. We design a dual-head architecture and implement two complementary training strategies for a robust model. (1) A diffusion head generating continuous speech representations is added on the MLLM, which is on frame-level and strictly autoregressive. (2) The original language model head is retained to preserve multitask capability and to control the start and end of speech synthesis. (3) Masked training is employed to address exposure bias in autoregressive decoding. (4) To stabilize optimization, we propose a two-stage scheme where the LM is frozen in the second stage, ensuring the diffusion head learns from a fixed input distribution. Evaluations on LibriSpeech(PC) test-clean show that our approach achieves state-of-the-art autoregressive performance, with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00. The two-stage training yields a 46% relative WER reduction over the one-stage training baseline. These results highlight the effectiveness of combining autoregressive modeling with continuous-token diffusion, supported by a two-stage training procedure.
翻译:多模态大语言模型(MLLM)中的统一架构已展现出在单一框架内处理多样化任务的潜力。在文本转语音(TTS)任务中,当前基于MLLM的方法依赖于离散令牌表示,这忽视了语音固有的连续性,并可能导致细粒度声学信息的丢失。在本工作中,我们利用连续语音表征研究了MLLM范式下的TTS任务。我们设计了一种双头架构,并实施了两种互补的训练策略以构建鲁棒模型。(1)在MLLM上添加了一个生成连续语音表征的扩散头,该扩散头工作在帧级别且严格自回归。(2)保留了原始的语言模型头以维持多任务能力,并控制语音合成的开始与结束。(3)采用掩码训练以解决自回归解码中的曝光偏差问题。(4)为稳定优化,我们提出了一种两阶段训练方案:在第二阶段冻结语言模型,确保扩散头从固定的输入分布中学习。在LibriSpeech(PC) test-clean上的评估表明,我们的方法实现了最先进的自回归性能,词错误率(WER)为1.95%,说话人相似度为0.54,UTMOS得分为4.00。两阶段训练相较于单阶段训练基线实现了46%的相对WER降低。这些结果凸显了将自回归建模与连续令牌扩散相结合,并辅以两阶段训练程序的有效性。