State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data. However, real acoustic environments are generally reverberant, which causes the performance to significantly deteriorate. To mitigate this mismatch between training data and real data, we simulate an augmented training set that contains nearly five million utterances. This extension comprises of anechoic utterances and their reverberant modifications, generated by convolutions of the anechoic utterances with a variety of room impulse responses (RIRs). We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set. We test all trained systems in three different real reverberant environments. Experimental results show $20\%$ increase on average in accuracy, precision and recall for all detectors and response models, compared to anechoic training. Furthermore, one of the RIR models consistently yields better performance than the other models, for all the tested VADs. Additionally, one of the VADs consistently outperformed the other VADs in all experiments.
翻译:最新的深层学习语音活动探测器(VADs)往往经过厌食性数据的培训,然而,真正的声学环境通常具有反响性能,导致性能大幅下降。为了减轻培训数据与真实数据之间的这种不匹配,我们模拟了包含近500万个发音的强化培训组。这一扩展包括厌食性发音及其反响性能的修改,这些发音由各种室脉冲反应的无节奏性发音演演演产生。我们认为,产生RIR的有五种不同的模型,以及经过强化培训的五种不同的VADs。我们在所有实验中测试了所有经过培训的系统。实验结果显示,所有探测器和反应模型的精度、精度和回顾平均增加了20美元。此外,对于所有经过测试的VADs,其中一个RIR模型的性能始终优于其他模型。此外,其中一个VADs在所有实验中均优于其他VADs。