Perceptual evaluation of speech quality (PESQ) requires a clean speech reference as input, but predicts the results from (reference-free) absolute category rating (ACR) tests. In this work, we train a fully convolutional recurrent neural network (FCRN) as deep noise suppression (DNS) model, with either a non-intrusive or an intrusive PESQNet, where only the latter has access to a clean speech reference. The PESQNet is used as a mediator providing a perceptual loss during the DNS training to maximize the PESQ score of the enhanced speech signal. For the intrusive PESQNet, we investigate two topologies, called early-fusion (EF) and middle-fusion (MF) PESQNet, and compare to the non-intrusive PESQNet to evaluate and to quantify the benefits of employing a clean speech reference input during DNS training. Detailed analyses show that the DNS trained with the MF-intrusive PESQNet outperforms the Interspeech 2021 DNS Challenge baseline and the same DNS trained with an MSE loss by 0.23 and 0.12 PESQ points, respectively. Furthermore, we can show that only marginal benefits are obtained compared to the DNS trained with the non-intrusive PESQNet. Therefore, as ACR listening tests, the PESQNet does not necessarily require a clean speech reference input, opening the possibility of using real data for DNS training.
翻译:对语言质量(PESQ)的感知性评价(PESQ)要求将语言质量(PESQ)作为投入,但预测了(无参考)绝对等级(ACR)测试的结果。在这项工作中,我们将完全进化的经常性神经网络(FCRN)作为深噪抑制模式,使用非侵入性或侵扰性的PESQNet,只有后者才能获得清洁语音参考,只有后者才能获得清洁语音参考。PESQNet被用作在DNS培训期间提供感知性损失的调解人,以最大限度地提高PESQ强化语音信号的评分。在侵入性PESQNet中,我们调查了两种称为早期聚集(EF)和中聚(MF)PESQNet)的全演常态性神经网络(FCRN),我们用经过培训的MESPES 2021 DNS 挑战性基准和相同的DNSDNS(DNS),我们用经过培训的MSEAR 12 来评估清洁性测试的MESA 。