An ideal description for a given video should fix its gaze on salient and representative content, which is capable of distinguishing this video from others. However, the distribution of different words is unbalanced in video captioning datasets, where distinctive words for describing video-specific salient objects are far less than common words such as 'a' 'the' and 'person'. The dataset bias often results in recognition error or detail deficiency of salient but unusual objects. To address this issue, we propose a novel learning strategy called Information Loss, which focuses on the relationship between the video-specific visual content and corresponding representative words. Moreover, a framework with hierarchical visual representations and an optimized hierarchical attention mechanism is established to capture the most salient spatial-temporal visual information, which fully exploits the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized architecture outperforms state-of-the-art video captioning methods on MSVD with CIDEr score 87.5, and achieves superior CIDEr score 47.7 on MSR-VTT. We also show that our Information Loss is generic which improves various models by significant margins.
翻译:给定视频的理想描述应该能够调整对突出和有代表性内容的视线,从而能够区分该视频与其他视频。然而,在视频字幕数据集中,不同词的分布不平衡,描述视频特定突出对象的独特词远不如描述“a”`the'和“person”等常见词。数据集偏差往往导致识别错误或细微突出但不寻常对象的缺陷。为解决这一问题,我们提议了一个名为“信息损失”的新学习战略,其重点是视频特定视频内容和相应的有代表性词之间的关系。此外,还建立了一个带有分级视觉表现和优化的上层关注机制的框架,以捕捉最突出的空间-时空视觉信息,充分利用了拟议学习战略的潜在力量。广泛的实验表明,巧妙的指导战略与优化的架构相比,在MSVD上与CIDer分数为87.5的高级视频说明方法相形异。我们还表明,我们的信息损失是通用的,通过显著的距离改进了各种模型。