Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech- text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech- text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs' capabilities. Experimental results demonstrate that DRVOICE-7B establishes new state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio benchmarks, making it a leading open-source speech foundation model in ~7B models.
翻译:近期,基于大语言模型(LLMs)的端到端(E2E)语音生成研究引起了学术界的广泛关注,多项工作将基于文本的LLMs扩展至生成离散语音标记。现有的E2E方法主要分为两类:(1)独立生成离散语音标记而未将其纳入LLM自回归过程的方法,导致文本生成无法感知并发的语音合成。(2)通过联合自回归建模生成交错或并行的语音-文本标记的模型,实现了生成过程中的跨模态相互感知。本文提出DrVoice,一种基于联合自回归建模的并行语音-文本语音对话模型,其核心特征是双分辨率语音表征。值得注意的是,当前方法主要使用12.5Hz的输入音频表征,而我们所提出的双分辨率机制将LLM的输入频率降至5Hz,显著降低了计算成本,缓解了语音与文本标记间的频率差异,从而更好地发挥LLMs的潜力。实验结果表明,DRVOICE-7B在OpenAudioBench和Big Bench Audio基准测试中确立了新的最先进(SOTA)水平,同时在VoiceBench和UltraEval-Audio基准测试上取得了与SOTA相当的性能,使其成为~7B参数规模中领先的开源语音基础模型。