Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag. The lack of movie description datasets with characters' visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the "someone" tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.
翻译:当前电影字幕结构无法用其正确名称提及字符, 代之以通用的“ 某人” 标签。 缺少带有字符视觉说明的电影描述数据集, 必然在这种短缺中起到相关作用 。 最近, 我们提议通过引入这种信息来扩展M- VAD数据集 。 在本文中, 我们展示了一个改进的数据集版本, 即 M- VAD 名称及其半自动注释程序 。 由此产生的数据集包含63k 直观路径和34k 文本标签, 都与字符身份相关 。 为了展示数据集的特征并量化命名任务的复杂性, 我们调查多式结构, 在现有视频标题中以正确的字符名称取代“ 某人” 标记 。 通过在 M- VAD 名称数据集外测试此应用程序, 评估会进一步扩展 。