DrVoice：基于双分辨率语音表征的并行语音-文本语音对话模型 (DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations)

Chao-Hong Tan,Qian Chen,Wen Wang,Chong Deng,Qinglin Zhang,Luyao Cheng,Hai Yu,Xin Zhang,Xiang Lv,Tianyu Zhao,Chong Zhang,Yukun Ma,Yafeng Chen,Hui Wang,Jiaqing Liu,Xiangang Li,Jieping Ye

from arxiv, Work in progress

Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech- text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech- text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs' capabilities. Experimental results demonstrate that DRVOICE-7B establishes new state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio benchmarks, making it a leading open-source speech foundation model in ~7B models.

翻译：近期，基于大语言模型（LLMs）的端到端（E2E）语音生成研究引起了学术界的广泛关注，多项工作将基于文本的LLMs扩展至生成离散语音标记。现有的E2E方法主要分为两类：（1）独立生成离散语音标记而未将其纳入LLM自回归过程的方法，导致文本生成无法感知并发的语音合成。（2）通过联合自回归建模生成交错或并行的语音-文本标记的模型，实现了生成过程中的跨模态相互感知。本文提出DrVoice，一种基于联合自回归建模的并行语音-文本语音对话模型，其核心特征是双分辨率语音表征。值得注意的是，当前方法主要使用12.5Hz的输入音频表征，而我们所提出的双分辨率机制将LLM的输入频率降至5Hz，显著降低了计算成本，缓解了语音与文本标记间的频率差异，从而更好地发挥LLMs的潜力。实验结果表明，DRVOICE-7B在OpenAudioBench和Big Bench Audio基准测试中确立了新的最先进（SOTA）水平，同时在VoiceBench和UltraEval-Audio基准测试上取得了与SOTA相当的性能，使其成为~7B参数规模中领先的开源语音基础模型。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日