In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
翻译:在本文中,我们展示了一个新颖的系统,通过使用目标发言者的参考信号,将目标发言者的声音与多发音信号区分开来。我们通过培训两个独立的神经网络来实现这一目标:(1) 产生语音偏差嵌入的语音识别网络;(2) 将噪音光谱和语音嵌入作为输入的光谱遮罩网络,并产生面具。我们的系统大大降低了多发音信号的语音识别WER,而单发信号的WER退化最小。