标题：关于训练神经网络语音分离模型的数据采样策略摘要：语音分离仍然是多说话人信号处理中一个重要的领域。深度神经网络（DNN）模型在许多语音分离基准测试中取得了最佳性能。其中一些模型可能需要较长的训练时间并且具有很高的内存需求。以往的工作已经提出缩短训练示例的方法以解决这些问题，但目前对于这种方法对模型性能的影响仍不是十分清楚。本文研究了应用这些训练信号长度限制（TSL）对两个语音分离模型： SepFormer，一个Transformers模型和Conv-TasNet，一个卷积模型的影响。使用WJS0-2Mix、WHAMR和Libri2Mix数据集以及信号长度分布进行了分析，探讨了其对训练效率的影响。研究表明，对于特定的分布，应用特定的TSL限制会导致更好的性能。这主要是由于随机采样波形的起始索引导致了更多独特的训练示例。使用TSL限制约为4.42秒和动态混音（DM）训练的SepFormer模型被证明可以与DM和无限信号长度训练的最佳表现SepFormer模型匹配。此外，4.42秒的TSL限制可以在使用WHAMR数据集训练时减少44%的训练时间。 (On Data Sampling Strategies for Training Neural Network Speech Separation Models)

翻译：标题：关于训练神经网络语音分离模型的数据采样策略摘要：语音分离仍然是多说话人信号处理中一个重要的领域。深度神经网络（DNN）模型在许多语音分离基准测试中取得了最佳性能。其中一些模型可能需要较长的训练时间并且具有很高的内存需求。以往的工作已经提出缩短训练示例的方法以解决这些问题，但目前对于这种方法对模型性能的影响仍不是十分清楚。本文研究了应用这些训练信号长度限制（TSL）对两个语音分离模型： SepFormer，一个Transformers模型和Conv-TasNet，一个卷积模型的影响。使用WJS0-2Mix、WHAMR和Libri2Mix数据集以及信号长度分布进行了分析，探讨了其对训练效率的影响。研究表明，对于特定的分布，应用特定的TSL限制会导致更好的性能。这主要是由于随机采样波形的起始索引导致了更多独特的训练示例。使用TSL限制约为4.42秒和动态混音（DM）训练的SepFormer模型被证明可以与DM和无限信号长度训练的最佳表现SepFormer模型匹配。此外，4.42秒的TSL限制可以在使用WHAMR数据集训练时减少44%的训练时间。

William Ravenscroft,Stefan Goetze,Thomas Hain

from arxiv, Submitted to EUSIPCO 2023

Speech separation remains an important area of multi-speaker signal processing. Deep neural network (DNN) models have attained the best performance on many speech separation benchmarks. Some of these models can take significant time to train and have high memory requirements. Previous work has proposed shortening training examples to address these issues but the impact of this on model performance is not yet well understood. In this work, the impact of applying these training signal length (TSL) limits is analysed for two speech separation models: SepFormer, a transformer model, and Conv-TasNet, a convolutional model. The WJS0-2Mix, WHAMR and Libri2Mix datasets are analysed in terms of signal length distribution and its impact on training efficiency. It is demonstrated that, for specific distributions, applying specific TSL limits results in better performance. This is shown to be mainly due to randomly sampling the start index of the waveforms resulting in more unique examples for training. A SepFormer model trained using a TSL limit of 4.42s and dynamic mixing (DM) is shown to match the best-performing SepFormer model trained with DM and unlimited signal lengths. Furthermore, the 4.42s TSL limit results in a 44% reduction in training time with WHAMR.

翻译：