The design of acoustic features is important for speech separation. It can be roughly categorized into three classes: handcrafted, parameterized, and learnable features. Among them, learnable features, which are trained with separation networks jointly in an end-to-end fashion, become a new trend of modern speech separation research, e.g. convolutional time domain audio separation network (Conv-Tasnet), while handcrafted and parameterized features are also shown competitive in very recent studies. However, a systematic comparison across the three kinds of acoustic features has not been conducted yet. In this paper, we compare them in the framework of Conv-Tasnet by setting its encoder and decoder with different acoustic features. We also generalize the handcrafted multi-phase gammatone filterbank (MPGTF) to a new parameterized multi-phase gammatone filterbank (ParaMPGTF). Experimental results on the WSJ0-2mix corpus show that (i) if the decoder is learnable, then setting the encoder to STFT, MPGTF, ParaMPGTF, and learnable features lead to similar performance; and (ii) when the pseudo-inverse transforms of STFT, MPGTF, and ParaMPGTF are used as the decoders, the proposed ParaMPGTF performs better than the other two handcrafted features.
翻译:声学特征的设计对于语音分离很重要, 它可以大致分为三类: 手工艺、 参数化和可学习的特征。 其中, 可学习的特征, 以终端到终端的方式与分离网络共同培训, 成为现代语音分离研究的新趋势, 例如: 传进时域音频分离网络( Conv- Tasnet), 而手工艺和参数化的特征在最近的研究中也表现出竞争力。 但是, 系统比较这三类声音特征尚未进行。 在本文中, 我们通过设置其编码器和不同声学特征的解码器, 来比较Conv- Tasnet框架中的可学习特征。 我们还将手工艺多阶段伽马酮过滤库(MPTF) 推广到一个新的参数化多阶段伽马酮过滤库(Conv-T) 。 WSWJ0-2mix 系统的实验结果表明, (i) 如果解调器可以学习, 然后将编码器设置给STFT、 MPTF、 ParaMPGT, 和可学习的功能可以转换为类似性表现( ) 。