An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based beamforming techniques satisfy these requirements by definition, while for deep learning-based end-to-end systems those constraints are not fully addressed. In this paper, we propose transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation. Based on the filter-and-sum network (FaSNet), a recently proposed end-to-end time-domain beamforming system, we show how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Moreover, we show that TAC also significantly improves the separation performance with fixed geometry array configuration, further proving the effectiveness of the proposed paradigm in the general problem of multi-microphone speech separation.
翻译:自动麦克风语音分离的一个重要问题是,如何保证一个系统在麦克风的位置和数量方面的稳健性。前者要求系统对同一地点的麦克风的不同索引化变化不一,而后者则要求系统能够处理不同层面的投入。常规优化型波束成型技术根据定义满足了这些要求,而对于深层次基于学习的端对端系统而言,这些限制没有得到充分解决。在本文件中,我们提议变换平均相(TAC),这是频道变换和变换多频道语音分离的简单设计范例。基于过滤和相和网(FASNet),这是最近提出的一个端到端时间-域波束成型系统,我们展示了TAC如何显著改善在与自动阵列的噪音反动分离任务中各种麦克风的分离性能。此外,我们还表明,TAC还大大改进了用固定几何阵形配置的分离性能,进一步证明了拟议的多式话隔音器分离总问题的范式的有效性。