以改进数字稳定性和高级前端实现端对端断层、波形和语音识别 (End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend)

Wangyou Zhang,Christoph Boeddeker,Shinji Watanabe,Tomohiro Nakatani,Marc Delcroix,Keisuke Kinoshita,Tsubasa Ochiai,Naoyuki Kamo,Reinhold Haeb-Umbach,Yanmin Qian

from arxiv, 5 pages, 1 figure, accepted by ICASSP 2021

Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel multi-speaker reverberant condition, and propose to extend our previous framework for end-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend subnetworks including voice activity detection like masks. The techniques significantly stabilize the end-to-end training process. The experiments on the spatialized wsj1-2mix corpus show that the proposed system achieves about 35% WER relative reduction compared to our conventional multi-channel E2E ASR system, and also obtains decent speech dereverberation and separation performance (SDR=12.5 dB) in the reverberant multi-speaker condition while trained only with the ASR criterion.

翻译：最近,端对端方法成功地适用于单声道和多声道条件下的多声频语音分离和识别,然而,在回旋和吵闹的情景中仍然观察到严重的性能退化,而且厌食和回旋条件之间仍然存在着很大的性能差距。在这项工作中,我们把重点放在多声道多声频回响状态上,并提议扩大我们以前关于端对端变换、波形和语音识别的框架,提高数字稳定性和前端子网络,包括语音活动探测,如面罩等。这些技术大大稳定了端对端培训过程。关于空间化 wsj1-2mix 的实验显示,与我们传统的多声道E2E ASR系统相比,拟议的系统实现了约35%的WER相对减幅,并在仅接受ASR标准培训的情况下,在回声式多声频波变异和分离状态下获得体面的语音变异性和分性性性性性功能(SDR=12.5 dB)。