Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising. In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. We also scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The code and pre-trained models are available at https://aka.ms/wavlm.
翻译:自我监督的学习(SSL)在语音识别方面取得巨大成功,而其他语言处理任务则尝试了有限的探索。由于语音信号包含多方面的信息,包括演讲者身份、语言学、口头内容等,学习所有演讲任务的普遍代表性具有挑战性。为了解决这个问题,我们提议了一个新的预培训模式,WavLM, 以解决全斯塔克下游演讲任务。WavLM 联合学习掩盖的语音预测和在培训前的训练中解密。通过这个方法,WavLM不仅保持了蒙面的语音预测的语音内容建模能力,而且还提高了通过语言解译的非ASR任务的潜力。此外,WavLM对变形器结构采用了封闭的相对位置偏差,以更好地捕捉到输入演讲的顺序。我们还将培训数据集从60k小时提高到94k小时。WavLM Master在SUPERB基准上实现了最先进的表现,并大大改进了各种语音处理任务在代表基准上的功能。该代码和预培训模型在 http://kas/regrams。