Current methods for video activity localisation over time assume implicitly that activity temporal boundaries labelled for model training are determined and precise. However, in unscripted natural videos, different activities mostly transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time. Such uncertainties in temporal labelling are currently ignored in model training, resulting in learning mis-matched video-text correlation with poor generalisation in test. In this work, we solve this problem by introducing Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries towards modelling universally interpretable video-text correlation with tolerance to underlying temporal uncertainties in pre-fixed annotations. Specifically, we construct elastic boundaries adaptively by mining and discovering frame-wise temporal endpoints that can maximise the alignment between video segments and query sentences. To enable both more robust matching (segment content attention) and more accurate localisation (segment elastic boundaries), we optimise the selection of frame-wise endpoints subject to segment-wise contents by a novel Guided Attention mechanism. Extensive experiments on three video activity localisation benchmarks demonstrate compellingly the EMB's advantages over existing methods without modelling uncertainty.
翻译:目前视频活动本地化的方法随着时间的推移而假定活动的时间界限是确定和精确的。然而,在未标定的自然视频中,不同的活动大多是顺畅的,因此在标签上确定精确的某一活动开始和结束的时间和时间的长短时,在本质上是模棱两可的。目前,在模拟培训中忽略了这种时间标签的不确定性,导致学习与测试中一般化程度差的不相称的视频文本相关性。在这项工作中,我们通过引入弹性超强的超强和适应性的活动时间界限(EMB)来解决这个问题,以模拟通用可解释的视频文本相关性,并容忍在预先固定的描述中潜在的时间不确定性。具体地说,我们通过采矿和发现框架化的时间端点来建立弹性界限,从而可以使视频段段和查询句之间的一致最大化。为了能够更强有力地匹配(对内容的注意)和更精确的本地化(对缩放弹性边界),我们选择选择通过新式导导引线机制选择可分段内容的框架偏向端端端点。在三个视频活动定位基准上进行广泛的实验,而没有可靠的模型显示EMM的优势。