保护OSVOS (In defense of OSVOS)

As a milestone for video object segmentation, one-shot video object segmentation (OSVOS) has achieved a large margin compared to the conventional optical-flow based methods regarding to the segmentation accuracy. Its excellent performance mainly benefit from the three-step training mechanism, that are: (1) acquiring object features on the base dataset (i.e. ImageNet), (2) training the parent network on the training set of the target dataset (i.e. DAVIS-2016) to be capable of differentiating the object of interest from the background. (3) online fine-tuning the interested object on the first frame of the target test set to overfit its appearance, then the model can be utilized to segment the same object in the rest frames of that video. In this paper, we argue that for the step (2), OSVOS has the limitation to 'overemphasize' the generic semantic object information while 'dilute' the instance cues of the object(s), which largely block the whole training process. Through adding a common module, video loss, which we formulate with various forms of constraints (including weighted BCE loss, high-dimensional triplet loss, as well as a novel mixed instance-aware video loss), to train the parent network in the step (2), the network is then better prepared for the step (3), i.e. online fine-tuning on the target instance. Through extensive experiments using different network structures as the backbone, we show that the proposed video loss module can improve the segmentation performance significantly, compared to that of OSVOS. Meanwhile, since video loss is a common module, it can be generalized to other fine-tuning based methods and similar vision tasks such as depth estimation and saliency detection.

翻译：作为视频天体分割的一个里程碑,一发视频天体分割(OSVOS)与常规光学流法相比,与常规光学流法相比,其优异性能主要得益于三步培训机制,即:(1) 在基数据集(即图像Net)上获取天体特征,(2) 在目标数据集的培训组(即DAVIS-2016)上对母网络进行培训,以便能够将对象与背景区分开来。(3) 在目标测试组的第一框架上对相关对象进行在线微调,使其看起来超标,然后该模型可用于在该视频的休息框中分割同一对象。在本文件中,我们认为,对于第(2)级(即,在基础数据集(即图像网)上获取天体特征特征特征特征特征特征特征特征特征,同时对目标数据集(即DAVIS-2016)进行实例提示,以便能够将整个培训过程与背景区分。通过添加一个通用模块,即视频损失,我们用多种制约(包括加权的BCEE损失,高度三维深度损失),然后该模型可用于在该视频图像框架的休息框框架下进行深度分析,作为新版本的模型测试,然后在网络上展示网络上大幅测试。