走向以视频为依据的贬低时间徒刑 (Towards Debiasing Temporal Sentence Grounding in Video) - 专知论文

会员服务 ·

0

矩 · MoDELS · 有偏 · 过采样 · Performer ·

2021 年 11 月 8 日

Towards Debiasing Temporal Sentence Grounding in Video

翻译：走向以视频为依据的贬低时间徒刑

Hao Zhang,Aixin Sun,Wei Jing,Joey Tianyi Zhou

from arxiv, 13 pages, 6 figures, 11 tables

The temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query, i.e., a sentence. Without considering bias in moment annotations (e.g., start and end positions in a video), many models tend to capture statistical regularities of the moment annotations, and do not well learn cross-modal reasoning between video and language query. In this paper, we propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions. Data debiasing performs data oversampling through video truncation to balance moment temporal distribution in train set. Model debiasing leverages video-only and query-only models to capture the distribution bias, and forces the model to learn cross-modal interactions. Using VSLNet as the base model, we evaluate impact of the two strategies on two datasets that contain out-of-distribution test instances. Results show that both strategies are effective in improving model generalization capability. Equipped with both debiasing strategies, VSLNet achieves best results on both datasets.

翻译：在视频( TSGV) 任务中, 将时间句定位在未剪辑的视频( TSGV) 上, 以匹配语言查询, 即句子。许多模型不考虑瞬间说明中的偏差( 例如, 在视频中的起始位置和结尾位置), 倾向于捕捉瞬间说明的统计规律性, 并且没有很好地学习视频和语言查询之间的跨模式推理。在本文中, 我们提议了两种贬低策略, 数据偏差和模型偏差, 以“ 强制” TSGV 模型来捕捉跨模式的互动。数据偏差表现了通过视频截断到平衡瞬间时间分布的数据。模型偏差利用仅视频和仅查询的模型来捕捉分布偏差, 并迫使模型学习跨模式互动。我们用 VSLNet 作为基础模型, 评估两个战略对包含分配外测试实例的两套数据集的影响。结果显示, 这两种战略在改进模型概括能力方面都是最有效的。

0

相关内容

【EMNLP2020】自然语言生成，Neural Language Generation

【EMNLP2020】自然语言生成，Neural Language Generation

专知会员服务

39+阅读 · 2020年11月20日

【清华大学-腾讯】关系提取综述，Review and Outlook for Relation Extraction

【清华大学-腾讯】关系提取综述，Review and Outlook for Relation Extraction

专知会员服务

38+阅读 · 2020年4月8日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

【CVPR2020】从未标记的视频中学习视频对象分割，Learning Video Object Segmentation from Unlabeled Videos

【CVPR2020】从未标记的视频中学习视频对象分割，Learning Video Object Segmentation from Unlabeled Videos

专知会员服务

36+阅读 · 2020年3月12日

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

专知会员服务

13+阅读 · 2020年3月12日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【WSDM2020】小数据学习，124页ppt，Learning with Small Data，宾夕法尼亚州立大学

【WSDM2020】小数据学习，124页ppt，Learning with Small Data，宾夕法尼亚州立大学

专知会员服务

137+阅读 · 2020年2月6日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

已删除

将门创投

8+阅读 · 2017年7月21日

Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition

Arxiv

8+阅读 · 2020年12月4日

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Arxiv

19+阅读 · 2020年3月31日

Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building

Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building

Arxiv

3+阅读 · 2020年3月12日

Contrastive Bidirectional Transformer for Temporal Representation Learning

Contrastive Bidirectional Transformer for Temporal Representation Learning

Arxiv

3+阅读 · 2019年6月13日

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

Arxiv

3+阅读 · 2019年6月11日

Multimodal Semantic Attention Network for Video Captioning

Arxiv

4+阅读 · 2019年5月8日

Video-to-Video Synthesis

Video-to-Video Synthesis

Arxiv

9+阅读 · 2018年8月20日

Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning

Arxiv

4+阅读 · 2018年4月13日

Integrating both Visual and Audio Cues for Enhanced Video Caption

Arxiv

4+阅读 · 2017年12月9日

Natural Language Guided Visual Relationship Detection

Arxiv

3+阅读 · 2017年11月21日

VIP会员

文章信息

相关主题

相关VIP内容

【EMNLP2020】自然语言生成，Neural Language Generation

【EMNLP2020】自然语言生成，Neural Language Generation

专知会员服务

39+阅读 · 2020年11月20日

【清华大学-腾讯】关系提取综述，Review and Outlook for Relation Extraction

【清华大学-腾讯】关系提取综述，Review and Outlook for Relation Extraction

专知会员服务

38+阅读 · 2020年4月8日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

【CVPR2020】从未标记的视频中学习视频对象分割，Learning Video Object Segmentation from Unlabeled Videos

【CVPR2020】从未标记的视频中学习视频对象分割，Learning Video Object Segmentation from Unlabeled Videos

专知会员服务

36+阅读 · 2020年3月12日

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

专知会员服务

13+阅读 · 2020年3月12日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【WSDM2020】小数据学习，124页ppt，Learning with Small Data，宾夕法尼亚州立大学

【WSDM2020】小数据学习，124页ppt，Learning with Small Data，宾夕法尼亚州立大学

专知会员服务

137+阅读 · 2020年2月6日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】迈向具有高维结果的可靠且稳健的因果推断

《美海军分布式海上作战（DMO）概念：最新情况》

Gemini 2.5：推动前沿，具备先进推理、多模态、长上下文及下一代智能体能力

【ICML2025教程】联想记忆的现代方法

相关资讯

已删除

将门创投

8+阅读 · 2017年7月21日

相关论文

Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition

Arxiv

8+阅读 · 2020年12月4日

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Arxiv

19+阅读 · 2020年3月31日

Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building

Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building

Arxiv

3+阅读 · 2020年3月12日

Contrastive Bidirectional Transformer for Temporal Representation Learning

Contrastive Bidirectional Transformer for Temporal Representation Learning

Arxiv

3+阅读 · 2019年6月13日

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

Arxiv

3+阅读 · 2019年6月11日

Multimodal Semantic Attention Network for Video Captioning

Arxiv

4+阅读 · 2019年5月8日

Video-to-Video Synthesis

Video-to-Video Synthesis

Arxiv

9+阅读 · 2018年8月20日

Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning

Arxiv

4+阅读 · 2018年4月13日

Integrating both Visual and Audio Cues for Enhanced Video Caption

Arxiv

4+阅读 · 2017年12月9日

Natural Language Guided Visual Relationship Detection

Arxiv

3+阅读 · 2017年11月21日

微信扫码咨询专知VIP会员