A speaker naming task, which finds and identifies the active speaker in a certain movie or drama scene, is crucial for dealing with high-level video analysis applications such as automatic subtitle labeling and video summarization. Modern approaches have usually exploited biometric features with a gradient-based method instead of rule-based algorithms. In a certain situation, however, a naive gradient-based method does not work efficiently. For example, when new characters are added to the target identification list, the neural network needs to be frequently retrained to identify new people and it causes delays in model preparation. In this paper, we present an attention-based method which reduces the model setup time by updating the newly added data via online adaptation without a gradient update process. We comparatively analyzed with three evaluation metrics(accuracy, memory usage, setup time) of the attention-based method and existing gradient-based methods under various controlled settings of speaker naming. Also, we applied existing speaker naming models and the attention-based model to real video to prove that our approach shows comparable accuracy to the existing state-of-the-art models and even higher accuracy in some cases.
翻译:发言人的命名任务在某个电影或戏剧场景中发现和识别活跃的演讲者,对于处理诸如自动字幕标签和视频摘要等高级视频分析应用至关重要。现代方法通常利用基于梯度的方法而不是基于规则的算法来利用生物鉴别特征。但是,在某些情况下,天真的基于梯度的方法效果不高。例如,当目标识别列表中添加新字符时,神经网络需要经常接受再培训,以识别新的人物,并造成模型制作的延误。在本文中,我们提出了一个基于注意的方法,通过不采用梯度更新过程的在线适应来更新新添加的数据,从而缩短模型设置时间。我们比较用三种评价指标(精确性、记忆使用、设置时间)来分析基于注意的方法和在各种受控的发言者命名环境中现有的基于梯度的方法。此外,我们将现有发言者命名模型和基于关注模型应用于真实视频,以证明我们的方法与现有最先进的模型相当准确性,在某些情况下甚至更精确性。