跨模式物体跟踪:模式-软件说明和统一基准 (Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark)

In many visual systems, visual tracking often bases on RGB image sequences, in which some targets are invalid in low-light conditions, and tracking performance is thus affected significantly. Introducing other modalities such as depth and infrared data is an effective way to handle imaging limitations of individual sources, but multi-modal imaging platforms usually require elaborate designs and cannot be applied in many real-world applications at present. Near-infrared (NIR) imaging becomes an essential part of many surveillance cameras, whose imaging is switchable between RGB and NIR based on the light intensity. These two modalities are heterogeneous with very different visual properties and thus bring big challenges for visual tracking. However, existing works have not studied this challenging problem. In this work, we address the cross-modal object tracking problem and contribute a new video dataset, including 654 cross-modal image sequences with over 481K frames in total, and the average video length is more than 735 frames. To promote the research and development of cross-modal object tracking, we propose a new algorithm, which learns the modality-aware target representation to mitigate the appearance gap between RGB and NIR modalities in the tracking process. It is plug-and-play and could thus be flexibly embedded into different tracking frameworks. Extensive experiments on the dataset are conducted, and we demonstrate the effectiveness of the proposed algorithm in two representative tracking frameworks against 17 state-of-the-art tracking methods. We will release the dataset for free academic usage, dataset download link and code will be released soon.

翻译：在许多视觉系统中,视觉跟踪往往以RGB图像序列为基础,其中某些目标在低光条件下无效,跟踪性能也因此受到严重影响。引入深度和红外数据等其他模式是处理单个来源成像限制的有效方法,但多式成像平台通常需要精心设计,目前无法应用于许多真实世界应用程序。近红外成像成为许多监视相机的一个基本组成部分,其成像可以在光亮度的基础上在RGB和NIR之间转换。这两种模式具有非常不同的视觉特性,因此给视觉跟踪带来巨大的挑战。然而,现有工作没有研究这一具有挑战性的问题。在此工作中,我们处理跨式物体跟踪问题,并提出一个新的视频数据集数据集数据集,包括总共481K框架的654个跨式成像序列,平均视频长度超过735个框架。为了促进跨式物体跟踪的研究和开发,我们提议一种新的算法,即学习模式识别目标,以缩小RGB和NIR的成像形数据下载差距,因此,在数据库跟踪过程中,将采用灵活的数据跟踪模式和数据跟踪模式。在数据跟踪中,将采用灵活数据跟踪模式。