自操作语义分解的 Mix- Match 混合和配配对图 (Mix-and-Match Tuning for Self-Supervised Semantic Segmentation)

Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g. ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g. image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a "mix-and-match" (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the "mix" stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A "match" stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.

翻译：用于语义图像分解的深革命网络通常需要大规模标签数据, 如图像Net 和 MS COCO, 用于网络预培训。为了减少批注努力, 最近建议自监管语义分解在没有人为提供标签的情况下对网络进行预培训。这种新学习方式的关键是设计代理任务( 如图像颜色化), 由此可以对未标注的数据进行歧视性损失。但是, 许多代理任务缺乏关键监管信号, 从而可能导致目标图像分解任务有区别性表示。因此, 自监管模式的性能仍然远远不同于受监管的预培训。在此研究中, 我们通过在自监督管道中加入“ 混合和匹配” (M & M) 调试调阶段。拟议的方法很容易被许多自我监督20 之前的配置方法所套用, 并且没有比原始程序更多样化的图像样本。然而, 它能够将目标图像分解的功能提升至超前的图像分解状态。因此, 自我监督模式可以改进“ 将数据引入比前的平流数据更精确的版本 ” 。。, 将数据引入了。改进了将数据引入了。