Existing speech enhancement methods mainly separate speech from noises at the signal level or in the time-frequency domain. They seldom pay attention to the semantic information of a corrupted signal. In this paper, we aim to bridge this gap by extracting phoneme identities to help speech enhancement. Specifically, we propose a phoneme-based distribution regularization (PbDr) for speech enhancement, which incorporates frame-wise phoneme information into speech enhancement network in a conditional manner. As different phonemes always lead to different feature distributions in frequency, we propose to learn a parameter pair, i.e. scale and bias, through a phoneme classification vector to modulate the speech enhancement network. The modulation parameter pair includes not only frame-wise but also frequency-wise conditions, which effectively map features to phoneme-related distributions. In this way, we explicitly regularize speech enhancement features by recognition vectors. Experiments on public datasets demonstrate that the proposed PbDr module can not only boost the perceptual quality for speech enhancement but also the recognition accuracy of an ASR system on the enhanced speech. This PbDr module could be readily incorporated into other speech enhancement networks as well.
翻译:现有的语音增强方法主要将语音与信号级别或时频域的噪音分开, 很少注意损坏信号的语义信息。 在本文中, 我们的目标是通过提取语音身份来弥合这一差距, 以帮助语音增强。 具体地说, 我们提议为语音增强采用基于手机的分布规范( PbDr ), 以有条件的方式将基于框架的语音信息纳入语音增强网络。 由于不同的电话总是导致频率的不同特征分布, 我们提议通过电话分类矢量和偏差来学习一个参数配对, 即比例和偏差, 以调控语音增强网络。 调制参数配对不仅包括框架性条件, 也包括频率性条件, 从而有效地绘制与电话相关的分布的特征。 这样, 我们明确规范了通过识别矢量增强语音的语音增强功能。 对公共数据集的实验表明, 拟议的 PbDr 模块不仅可以提高语音增强的视觉质量, 还可以提高强化语音增强的ASR系统的识别精度。 这个 PbDr 模块可以很容易纳入其他语音增强的网络 。