Recently, an audio-visual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audio-visual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. However, audio-visual VAE is not robust against noisy visual data, e.g., when for some video frames, speaker face is not frontal or lips region is occluded. In this paper, we propose a robust unsupervised audio-visual speech enhancement method based on a per-frame VAE mixture model. This mixture model consists of a trained audio-only VAE and a trained audio-visual VAE. The motivation is to skip noisy visual frames by switching to the audio-only VAE model. We present a variational expectation-maximization method to estimate the parameters of the model. Experiments show the promising performance of the proposed method.
翻译:最近,提出了基于变异自动读取器(VAE)的视听语音转换模型,该模型与非负式矩阵因子化模型(NMF)的噪音变化模型(NMF)相结合,可以进行不受监督的语音增强。当视觉数据清洁时,使用视听VAE的语音增强显示的性能优于只用音频数据培训的音频VAE。然而,视听VAE对于噪音的视觉数据并不强大,例如,对于某些视频框架而言,发言者的脸不是正面或嘴唇区域是隐蔽的。在本文件中,我们提出了一个强健的、不受监督的视听语音增强方法,其基础是每个框架VAE的混合模型。这种混合模型包括经过训练的只音频 VAE和经过训练的视听VAE。其动机是通过切换只音频VAE模型而避免变音频的视觉框架。我们提出了一种变式预期-摩西化方法来估计模型的参数。实验显示拟议方法的前景。