Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. This data scarcity can degrade model performance under unseen conditions and limit generalization ability. To this end, in this work, we approach this problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only diffusion priors trained on individual sources. Separation is then achieved by iteratively guiding an initial state toward the solution through reconstruction guidance. Importantly, we introduce an advanced inverse problem solver specifically designed for separation, which mitigates gradient conflicts caused by interference between the diffusion prior and reconstruction guidance during inverse denoising. This design ensures high-quality and balanced separation performance across individual sources. Additionally, we find that initializing the denoising process with an augmented mixture instead of pure Gaussian noise provides an informative starting point that significantly improves the final performance. To further enhance audio prior modeling, we design a novel time-frequency attention-based network architecture that demonstrates strong audio modeling capability. Collectively, these improvements lead to significant performance gains, as validated across speech-sound event, sound event, and speech separation tasks.
翻译:单通道音频分离旨在从单通道混合音频中分离出各个独立的声源。现有方法大多依赖于使用合成生成的配对数据进行监督学习。然而,在现实场景中获取高质量的配对数据往往十分困难。这种数据稀缺性可能导致模型在未见条件下的性能下降,并限制其泛化能力。为此,本研究从无监督的角度处理该问题,将其构建为一个概率逆问题。我们的方法仅需在单个声源上训练的扩散先验。随后,通过重构引导将初始状态迭代地导向解,从而实现分离。重要的是,我们引入了一种专门为分离任务设计的高级逆问题求解器,该求解器缓解了在逆去噪过程中由扩散先验与重构引导之间的相互干扰所引起的梯度冲突。这一设计确保了各个声源的高质量且平衡的分离性能。此外,我们发现使用增强的混合音频而非纯高斯噪声来初始化去噪过程,能够提供一个信息丰富的起点,从而显著提升最终性能。为了进一步增强音频先验建模,我们设计了一种新颖的基于时频注意力的网络架构,该架构展现出强大的音频建模能力。综合来看,这些改进带来了显著的性能提升,并在语音-声音事件、声音事件以及语音分离任务中得到了验证。