Recently, speech recognition with ad-hoc microphone arrays has received much attention. It is known that channel selection is an important problem of ad-hoc microphone arrays, however, this topic seems far from explored in speech recognition yet, particularly with a large-scale ad-hoc microphone array. To address this problem, we propose a Scaling Sparsemax algorithm for the channel selection problem of the speech recognition with large-scale ad-hoc microphone arrays. Specifically, we first replace the conventional Softmax operator in the stream attention mechanism of a multichannel end-to-end speech recognition system with Sparsemax, which conducts channel selection by forcing the channel weights of noisy channels to zero. Because Sparsemax punishes the weights of many channels to zero harshly, we propose Scaling Sparsemax which punishes the channels mildly by setting the weights of very noisy channels to zero only. Experimental results with ad-hoc microphone arrays of over 30 channels under the conformer speech recognition architecture show that the proposed Scaling Sparsemax yields a word error rate of over 30% lower than Softmax on simulation data sets, and over 20% lower on semi-real data sets, in test scenarios with both matched and mismatched channel numbers.
翻译:最近,对语音声音阵列的语音识别最近引起了人们的极大关注。众所周知,频道选择是特殊组合式麦克风阵列的一个重要问题。然而,这个话题似乎远未在语音识别中探索,特别是大型的自动组合式麦克风阵列。为了解决这个问题,我们建议对语音识别频道选择问题采用一个缩放缩放法算法,用大型的自动组合式麦克风阵列来处理。具体地说,我们首先将多频道端对端对端语音识别系统中的常规软体操作员替换为Spassemax,通过将噪音频道的频道重量强迫到零,进行频道选择。由于Spassemax将许多频道的重量降为零,我们提议“缩放法马克斯”,通过将非常吵的频道重量定为零,轻度惩罚频道。在符合语音识别结构的30多个频道的自动组合式麦克风阵列中,实验结果显示,拟议的Spassemax在模拟场景中,将软式频道的音错率降为30%以上,同时将模拟频道数据设置为20调。