In this paper we propose a conditioned UNet for Music Source Separation (MSS). MSS is generally performed by multi-output neural networks, typically UNets, with each output representing a particular stem from a predefined instrument vocabulary. In contrast, conditioned MSS networks accept an audio query related to a stem of interest alongside the signal from which that stem is to be extracted. Thus, a strict vocabulary is not required and this enables more realistic tasks in MSS. The potential of conditioned approaches for such tasks has been somewhat hidden due to a lack of suitable data, an issue recently addressed with the MoisesDb dataset. A recent method, Banquet, employs this dataset with promising results seen on larger vocabularies. Banquet uses Bandsplit RNN rather than a UNet and the authors state that UNets should not be suitable for conditioned MSS. We counter this argument and propose QSCNet, a novel conditioned UNet for MSS that integrates network conditioning elements in the Sparse Compressed Network for MSS. We find QSCNet to outperform Banquet by over 1dB SNR on a couple of MSS tasks, while using less than half the number of parameters.
翻译:本文提出了一种用于音乐源分离(MSS)的条件化UNet模型。音乐源分离通常由多输出神经网络(典型如UNet)实现,每个输出对应预定义乐器词汇表中的特定音轨。相比之下,条件化MSS网络在接收待分离信号的同时,还接受与目标音轨相关的音频查询作为输入。因此,该方法无需严格限定词汇表,能够支持更符合实际应用场景的MSS任务。由于缺乏合适的数据集,此类条件化方法的潜力长期未被充分发掘,而近期发布的MoisesDb数据集有效解决了这一问题。最新方法Banquet利用该数据集在扩展词汇表任务中展现出良好性能,但该方法采用Bandsplit RNN而非UNet架构,其作者声称UNet不适用于条件化MSS任务。本文对此观点提出反驳,并创新性地提出QSCNet——一种集成网络条件化模块于稀疏压缩网络架构的新型条件化UNet模型。实验表明,在多项MSS任务中,QSCNet以不足半数参数量实现了超过Banquet方法1dB以上的信噪比提升。