Inspired by the recent development of deep network-based methods in semantic image segmentation, we introduce an end-to-end trainable model for face mask extraction in video sequence. Comparing to landmark-based sparse face shape representation, our method can produce the segmentation masks of individual facial components, which can better reflect their detailed shape variations. By integrating Convolutional LSTM (ConvLSTM) algorithm with Fully Convolutional Networks (FCN), our new ConvLSTM-FCN model works on a per-sequence basis and takes advantage of the temporal correlation in video clips. In addition, we also propose a novel loss function, called Segmentation Loss, to directly optimise the Intersection over Union (IoU) performances. In practice, to further increase segmentation accuracy, one primary model and two additional models were trained to focus on the face, eyes, and mouth regions, respectively. Our experiment shows the proposed method has achieved a 16.99% relative improvement (from 54.50% to 63.76% mean IoU) over the baseline FCN model on the 300 Videos in the Wild (300VW) dataset.
翻译:在语义图像分割方面,我们最近开发了基于深网络的语义图像分割法,因此,我们引入了在视频序列中面罩提取的端到端的可训练模型。比较了基于里程碑的分散面形显示,我们的方法可以产生单个面部组成部分的分解面罩,这可以更好地反映其详细的形状变化。通过将Convolutional LSTM(ConvLSTM)算法与完整的进化网络(FCN)相结合,我们新的ConvLSTM-FCN模型按顺序进行工作,并利用视频剪辑中的时间相关性。此外,我们还提议了一个新的损失函数,称为分解损失函数,直接优化交错功能。在实践中,为了进一步提高分解准确性,我们培训了一个主要模型和另外两个模型,分别侧重于脸部、眼睛和嘴部区域。我们的实验表明,拟议方法比野生(300VW)数据集300个视频模型的基准FCN模型取得了16.99%的相对改进(从54.50%到63.76 %为IoU)。