We focus on the task of Automatic Live Video Commenting (ALVC), which aims to generate real-time video comments based on both video frames and other viewers' remarks. An intractable challenge in this task is the appropriate modeling of complex dependencies between video and textual inputs. Previous work in the ALVC task applies separate attention on these two input sources to obtain their representations. In this paper, we argue that the information of video and text should be modeled integrally. We propose a novel model equipped with a Diversified Co-Attention layer (DCA) and a Gated Attention Module (GAM). DCA allows interactions between video and text from diversified perspectives via metric learning, while GAM collects an informative context for comment generation. We further introduce a parameter orthogonalization technique to allieviate information redundancy in DCA. Experiment results show that our model outperforms previous approaches in the ALVC task and the traditional co-attention model, achieving state-of-the-art results.
翻译:我们的重点是自动现场视频评论(ALVC)的任务,其目的是根据视频框架和其他观众的评论产生实时视频评论,这项任务的一个棘手挑战是对视频和文字投入之间的复杂依赖性进行适当的建模。ALVC以前的工作分别关注这两个输入源,以获得它们的表述。在本文中,我们主张视频和文字信息应当完整地建模。我们提议了一个新颖的模型,配有多样化的共同保护层(DCA)和Gated 注意模块(GAM)。DCA允许通过计量学习从不同角度对视频和文字进行互动,而GAM则收集了生成评论的信息背景。我们进一步引入参数或分解技术来减少DCA的信息冗余。实验结果显示,我们的模型比以前在ALVC任务和传统共同关注模型中的做法更完善,实现了最新的结果。