We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. However, their variation covered by existing labeled hand datasets is limited. Thus, it is necessary to adapt the model trained on the labeled images (source) to unlabeled images (target) with unseen imaging conditions. While self-training domain adaptation methods (i.e., learning from the unlabeled target images in a self-supervised manner) have been developed for both tasks, their training may degrade performance when the predictions on the target images are noisy. To avoid this, it is crucial to assign a low importance (confidence) weight to the noisy predictions during self-training. In this paper, we propose to utilize the divergence of two predictions to estimate the confidence of the target image for both tasks. These predictions are given from two separate networks, and their divergence helps identify the noisy predictions. To integrate our proposed confidence estimation into self-training, we propose a teacher-student framework where the two networks (teachers) provide supervision to a network (student) for self-training, and the teachers are learned from the student by knowledge distillation. Our experiments show its superiority over state-of-the-art methods in adaptation settings with different lighting, grasping objects, backgrounds, and camera viewpoints. Our method improves by 4% the multi-task score on HO3D compared to the latest adversarial adaptation method. We also validate our method on Ego4D, egocentric videos with rapid changes in imaging conditions outdoors.
翻译:我们的目标是改进在新的成像条件下(如室外),当我们仅仅在非常不同的条件下(如室内)摄取标签图像时,在新的成像条件下(如室外),对手键点和像素级像素面面罩进行回缩的功能。在现实世界中,为这两项任务所培训的模型必须在不同的成像条件下发挥作用。然而,现有的标签式手动数据集所覆盖的变异作用是有限的。因此,有必要将标签式图像(源)所培训的模型调整为具有不可见成像条件的未标记的直观图像(目标)。虽然已经为这两项任务开发了自我训练领域调整方法(即从未贴标签的目标图像中学习以自我监督的方式),但在目标图像的预测很吵的时候,它们的培训可能会降低性能。为避免这一点,我们必须在自我训练过程中将低度(信心)的重量赋予噪音预测。在本文中,我们提议用两种直观的预测方法来估计两个目标图像的置信度。这些预测来自两个独立的网络,从显示内部的自我评估方法,它们的差异有助于确定我们最新的预测。