Several training strategies and temporal models have been recently proposed for isolated word lip-reading in a series of independent works. However, the potential of combining the best strategies and investigating the impact of each of them has not been explored. In this paper, we systematically investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies, like self-distillation and using word boundary indicators. Our results show that Time Masking (TM) is the most important augmentation followed by mixup and Densely-Connected Temporal Convolutional Networks (DC-TCN) are the best temporal model for lip-reading of isolated words. Using self-distillation and word boundary indicators is also beneficial but to a lesser extent. A combination of all the above methods results in a classification accuracy of 93.4%, which is an absolute improvement of 4.6% over the current state-of-the-art performance on the LRW dataset. The performance can be further improved to 94.1% by pre-training on additional datasets. An error analysis of the various training strategies reveals that the performance improves by increasing the classification accuracy of hard-to-recognise words.
翻译:最近为一系列独立作品中的单词唇读取提出了若干培训战略和时间模型,然而,尚未探讨将最佳战略结合起来并调查每种战略影响的可能性。在本文件中,我们系统地调查了最新数据增强方法、时间模型和其他培训战略的绩效,如自我蒸馏和使用字界指标。我们的结果表明,时间掩码(TM)是最重要的增强,随后是混合和共振的时空演动网络(DC-TCN),是单词读取的最佳时间模型。使用自我蒸馏和字界指标也是有益的,但程度较小。上述所有方法的结合导致93.4%的分类准确性,这比目前LRW数据集的最新表现绝对提高了4.6%。通过对额外数据集进行预先培训,业绩可以进一步改进到94.1%。对各种培训战略的错误分析表明,通过提高硬对单词的准确性来改进业绩。