Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.
翻译:单目深度估计(MDE)通过单目相机从单张RGB图像推断像素级深度,在多种需要三维(3D)地形场景的人工智能应用中扮演着关键且核心的角色。在现实场景中,MDE模型通常需要部署在与训练条件不同的环境中。测试时(领域)自适应(TTA)是解决这一问题的有力且实用的方法之一。尽管MDE的TTA领域已取得显著进展,特别是在自监督方式下,但现有方法在应用于多样化和动态环境时仍存在效果不佳和问题重重的情况。为突破这一挑战,我们提出了一种新颖且高性能的MDE TTA框架,命名为PITTA。我们的方法融合了两项关键创新策略:(i)姿态无关的MDE TTA范式,以及(ii)实例感知的图像掩码。具体而言,PITTA能够以姿态无关的方式,在不依赖任何相机姿态信息的情况下,对预训练的MDE网络进行高效TTA。此外,我们的实例感知掩码策略通过从预训练的全景分割网络生成的分割掩码中移除包括背景组件在内的静态对象,提取动态对象(如车辆、行人等)的实例级掩码。为进一步提升性能,我们还提出了一种简单而有效的输入图像(即单张单目图像)和深度图的边缘提取方法。在具有不同环境条件的DrivingStereo和Waymo数据集上进行的大量实验评估表明,我们提出的框架PITTA在TTA期间的MDE性能上显著超越了现有的最先进技术,取得了卓越的性能提升。