Traditional speech enhancement systems produce speech with compromised quality. Here we propose to use the high quality speech generation capability of neural vocoders for better quality speech enhancement. We term this parametric resynthesis (PR). In previous work, we showed that PR systems generate high quality speech for a single speaker using two neural vocoders, WaveNet and WaveGlow. Both these vocoders are traditionally speaker dependent. Here we first show that when trained on data from enough speakers, these vocoders can generate speech from unseen speakers, both male and female, with similar quality as seen speakers in training. Next using these two vocoders and a new vocoder LPCNet, we evaluate the noise reduction quality of PR on unseen speakers and show that objective signal and overall quality is higher than the state-of-the-art speech enhancement systems Wave-U-Net, Wavenet-denoise, and SEGAN. Moreover, in subjective quality, multiple-speaker PR out-performs the oracle Wiener mask.
翻译:传统语音增强系统产生质量受损的语音。 我们在这里建议使用神经电动器的高品质语音生成能力来提高语言质量。 我们用这种参数合成(PR) 。 在以往的工作中, 我们用两种神经电动器、 WaveNet 和 WaveGlow 来显示, 传统语音增强系统为单一发言者提供高质量的语音。 这两种电动器传统上都依赖语音。 我们在这里首先显示, 在接受足够演讲者提供的数据培训后, 这些电动器可以产生与培训时的演讲者质量相似的隐性演讲者( 不论男女) 的语音生成能力。 下一步, 我们使用这两种电动器和一个新的电动器 LPCNet, 我们评估隐形发言者的语音降低质量, 并显示, 客观信号和总体质量高于最先进的语音增强系统Wave-U- Net、 Wavenet-denoise和 SEGAN。 此外, 在主观质量上, 多发声器的PR 超越了Wiener 面具。