Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contrast, spatial clustering techniques can achieve such generalization but lack a strong signal model. Our work proposes a combination of the two approaches. By using LSTMs to enhance spatial clustering based time-frequency masks, we achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance and generality of multi-channel spatial clustering. We compare our proposed system to several baselines on the CHiME-3 dataset. We evaluate the quality of the audio from each system using SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
翻译:最近的工作表明,使用LSTM结构的深层常务神经网络可以通过估计时频掩码实现强大的单声道增强。然而,这些模型并不自然地向不同麦克风配置的多声道输入进行概括化。相反,空间集群技术可以实现这种概括化,但缺乏强有力的信号模型。我们的工作提出了两种方法的组合。通过使用LSTMS加强基于时间频掩码的空间集群,我们实现了多个单声道LSTM-DNN语音增强器的信号建模性能以及多声道空间组合的信号分离性能和一般性能。我们将我们提议的系统与CHime-3数据集的若干基线进行比较。我们利用BSS-eval工具包和PESQ的特别提款权评估了每个系统声音的质量。我们用Kaldi自动语音识别器的字差错率评估了每个系统输出的不易感知性。