打法与FIRE：评估文本到视频检索基准的有效性 (Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks) - 专知论文

会员服务 ·

0

视频检索 · 基准 · 有效性 · 视频 · 假阴性 ·

2023 年 4 月 19 日

Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks

翻译：打法与FIRE：评估文本到视频检索基准的有效性

Pedro Rodriguez,Mahmoud Azab,Becka Silvert,Renato Sanchez,Linzy Labson,Hardik Shah,Seungwhan Moon

from arxiv, EACL 2023 Camera Ready

Searching troves of videos with textual descriptions is a core multimodal retrieval task. Owing to the lack of a purpose-built dataset for text-to-video retrieval, video captioning datasets have been re-purposed to evaluate models by (1) treating captions as positive matches to their respective videos and (2) assuming all other videos to be negatives. However, this methodology leads to a fundamental flaw during evaluation: since captions are marked as relevant only to their original video, many alternate videos also match the caption, which introduces false-negative caption-video pairs. We show that when these false negatives are corrected, a recent state-of-the-art model gains 25\% recall points -- a difference that threatens the validity of the benchmark itself. To diagnose and mitigate this issue, we annotate and release 683K additional caption-video pairs. Using these, we recompute effectiveness scores for three models on two standard benchmarks (MSR-VTT and MSVD). We find that (1) the recomputed metrics are up to 25\% recall points higher for the best models, (2) these benchmarks are nearing saturation for Recall@10, (3) caption length (generality) is related to the number of positives, and (4) annotation costs can be mitigated through sampling. We recommend retiring these benchmarks in their current form, and we make recommendations for future text-to-video retrieval benchmarks.

翻译：搜索具有文本描述的视频存档是一项核心的多模态检索任务。由于缺乏专门针对文本到视频检索的数据集，因此视频字幕数据集已被重新用于通过以下方式评估模型:(1)将字幕视为其相应视频的正匹配，和(2)假设所有其他视频都是负信息。但是，这种方法在评估过程中存在一个根本缺陷：由于仅将字幕标记为与其原始视频相关，许多其他视频也与其匹配，这引入了假阴性的字幕-视频匹配。我们表明，当这些假阴性被修正后，最近的最先进模型的召回率提高了25个百分点，这种差异威胁到基准本身的有效性。为了诊断和缓解这个问题，我们注释并发布了683K个额外的字幕视频对。使用这些数据，我们重新计算了两个标准基准(MSR-VTT和MSVD)上三个模型的有效性分数。我们发现:(1)最佳模型的重新计算度量值高出25个百分点，(2)这些基准正在接近Recall@10的饱和度，(3)字幕长度(一般性)与阳性数量有关，(4)可以通过抽样来缓解注释成本。我们建议以其当前形式退休这些基准，并为未来的文本到视频检索基准提供建议。

0

相关内容

视频检索

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【ICIP2019教程-NVIDIA】图像到图像转换，附7份PPT下载

【ICIP2019教程-NVIDIA】图像到图像转换，附7份PPT下载

专知会员服务

55+阅读 · 2019年11月20日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新七篇图像描述生成相关论文—CNN+CNN、对抗样本、显著性和上下文注意力、条件生成对抗网络、风格化

【论文推荐】最新七篇图像描述生成相关论文—CNN+CNN、对抗样本、显著性和上下文注意力、条件生成对抗网络、风格化

专知

25+阅读 · 2018年5月28日

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

专知

10+阅读 · 2018年4月12日

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

专知

10+阅读 · 2018年3月2日

【论文推荐】最新六篇视觉问答（VQA）相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

【论文推荐】最新六篇视觉问答（VQA）相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

专知

32+阅读 · 2018年2月28日

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

专知

12+阅读 · 2018年2月2日

基于图像模型绘制的大规模场景自由可量测全景再现

国家自然科学基金

0+阅读 · 2013年12月31日

基于ＭＰ时频特征的电影音频场景语义推理研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于BMA的降低地下水污染场址评估不确定性研究

国家自然科学基金

0+阅读 · 2013年12月31日

域名服务体系关键区域析取及域名解析依赖关系结构脆弱性评估研究

国家自然科学基金

1+阅读 · 2012年12月31日

Fucí意义下的跨共振的Sturm-Liouville问题

国家自然科学基金

0+阅读 · 2012年12月31日

Hadoop云存储中基于Ordinal Bloom filter的多维索引关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

miR-23调控疲劳型亚健康骨骼肌线粒体再生的机制及维康颗粒的干预研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于语言特性分析的互联网伪信息的自动识别与评估研究

国家自然科学基金

0+阅读 · 2011年12月31日

面向缺陷的软件系统可靠性管理规范的研究

国家自然科学基金

0+阅读 · 2009年12月31日

面向紧急事件的无线传感器网络信息可靠传输理论与关键技术

国家自然科学基金

0+阅读 · 2009年12月31日

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

Arxiv

1+阅读 · 2023年6月5日

Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval

Arxiv

0+阅读 · 2023年6月5日

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Arxiv

0+阅读 · 2023年6月2日

End-to-end Knowledge Retrieval with Multi-modal Queries

Arxiv

0+阅读 · 2023年6月1日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

Deep Image Retrieval: A Survey

Arxiv

16+阅读 · 2021年1月27日

Image-to-Image Retrieval by Learning Similarity between Scene Graphs

Arxiv

21+阅读 · 2020年12月29日

Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

Arxiv

15+阅读 · 2018年12月4日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【ICIP2019教程-NVIDIA】图像到图像转换，附7份PPT下载

【ICIP2019教程-NVIDIA】图像到图像转换，附7份PPT下载

专知会员服务

55+阅读 · 2019年11月20日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新七篇图像描述生成相关论文—CNN+CNN、对抗样本、显著性和上下文注意力、条件生成对抗网络、风格化

【论文推荐】最新七篇图像描述生成相关论文—CNN+CNN、对抗样本、显著性和上下文注意力、条件生成对抗网络、风格化

专知

25+阅读 · 2018年5月28日

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

专知

10+阅读 · 2018年4月12日

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

专知

10+阅读 · 2018年3月2日

【论文推荐】最新六篇视觉问答（VQA）相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

【论文推荐】最新六篇视觉问答（VQA）相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

专知

32+阅读 · 2018年2月28日

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

专知

12+阅读 · 2018年2月2日

相关论文

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

Arxiv

1+阅读 · 2023年6月5日

Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval

Arxiv

0+阅读 · 2023年6月5日

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Arxiv

0+阅读 · 2023年6月2日

End-to-end Knowledge Retrieval with Multi-modal Queries

Arxiv

0+阅读 · 2023年6月1日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

Deep Image Retrieval: A Survey

Arxiv

16+阅读 · 2021年1月27日

Image-to-Image Retrieval by Learning Similarity between Scene Graphs

Arxiv

21+阅读 · 2020年12月29日

Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

Arxiv

15+阅读 · 2018年12月4日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

相关基金

基于图像模型绘制的大规模场景自由可量测全景再现

国家自然科学基金

0+阅读 · 2013年12月31日

基于ＭＰ时频特征的电影音频场景语义推理研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于BMA的降低地下水污染场址评估不确定性研究

国家自然科学基金

0+阅读 · 2013年12月31日

域名服务体系关键区域析取及域名解析依赖关系结构脆弱性评估研究

国家自然科学基金

1+阅读 · 2012年12月31日

Fucí意义下的跨共振的Sturm-Liouville问题

国家自然科学基金

0+阅读 · 2012年12月31日

Hadoop云存储中基于Ordinal Bloom filter的多维索引关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

miR-23调控疲劳型亚健康骨骼肌线粒体再生的机制及维康颗粒的干预研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于语言特性分析的互联网伪信息的自动识别与评估研究

国家自然科学基金

0+阅读 · 2011年12月31日

面向缺陷的软件系统可靠性管理规范的研究

国家自然科学基金

0+阅读 · 2009年12月31日

面向紧急事件的无线传感器网络信息可靠传输理论与关键技术

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员