We focus on the word-level visual lipreading, which requires to decode the word from the speaker's video. Recently, many state-of-the-art visual lipreading methods explore the end-to-end trainable deep models, involving the use of 2D convolutional networks (e.g., ResNet) as the front-end visual feature extractor and the sequential model (e.g., Bi-LSTM or Bi-GRU) as the back-end. Although a deep 2D convolution neural network can provide informative image-based features, it ignores the temporal motion existing between the adjacent frames. In this work, we investigate the spatial-temporal capacity power of I3D (Inflated 3D ConvNet) for visual lipreading. We demonstrate that, after pre-trained on the large-scale video action recognition dataset (e.g., Kinetics), our models show a considerable improvement of performance on the task of lipreading. A comparison between a set of video model architectures and input data representation is also reported. Our extensive experiments on LRW shows that a two-stream I3D model with RGB video and optical flow as the inputs achieves the state-of-the-art performance.

点赞 0
深度学习—从算法到实战,涵盖深度学习算法和应用实例,包括计算机视觉的目标检测、图像生成,自然语言处理的文本自动摘要等,帮助学员了解、理解、掌握深度学习的基础和前沿算法,并拥有深度学习算法实战经验。本课程由完整全面、脉络清晰的深度学习核心算法入门,到当前学界、工业界热门的深度学习应用实战,有效提高学生解决实际问题的能力。通过学习本课程,学员可以:掌握深度学习核心算法技术;掌握面向不用场景任务的深度学习应用技术;熟悉各种不同深度神经网络的拓扑结构及应用;熟悉前沿深度学习强化学习等热点技术,把握深度学习的技术发展趋势;提升解决深度学习实际问题的能力。 本次课程由专知团队携人工智能领域一线教授博士精心制作,重磅推出!这是一次毫无保留的传授与交流,人工智能未来已来,学习永不止步。希望能与各位一起迎接2019,共同成长。 https://study.163.com/course/introduction/1006498024.htm