Speech separation and enhancement (SSE) has advanced remarkably and achieved promising results in controlled settings, such as a fixed number of speakers and a fixed array configuration. Towards a universal SSE system, single-channel systems have been extended to deal with a variable number of speakers (i.e., outputs). Meanwhile, multi-channel systems accommodating various array configurations (i.e., inputs) have been developed. However, these attempts have been pursued separately. In this paper, we propose a flexible input and output SSE system, named FlexIO. It performs conditional separation using prompt vectors, one per speaker as a condition, allowing separation of an arbitrary number of speakers. Multi-channel mixtures are processed together with the prompt vectors via an array-agnostic channel communication mechanism. Our experiments demonstrate that FlexIO successfully covers diverse conditions with one to five microphones and one to three speakers. We also confirm the robustness of FlexIO on CHiME-4 real data.
翻译:语音分离与增强(SSE)技术已取得显著进展,并在受控环境下(如固定说话人数目和固定阵列配置)获得了令人满意的结果。为实现通用的SSE系统,单通道系统已被扩展以处理可变数量的说话人(即输出)。同时,能够适应多种阵列配置(即输入)的多通道系统也相继被开发。然而,这些尝试目前仍处于各自独立发展的状态。本文提出一种灵活的输入输出SSE系统,命名为FlexIO。该系统利用提示向量(每个说话人对应一个提示向量作为条件)进行条件分离,从而实现对任意数量说话人的分离。多通道混合信号通过一种与阵列无关的通道通信机制,与提示向量共同进行处理。实验表明,FlexIO能够成功覆盖从一到五个麦克风以及一到三个说话人的多种条件。我们还在CHiME-4真实数据上验证了FlexIO的鲁棒性。