Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.
翻译:视频单目深度估计对于自动驾驶、增强现实/虚拟现实及机器人技术等应用至关重要。当前基于Transformer的单帧单目深度估计模型在单幅图像上表现良好,但在视频帧间的深度一致性方面存在困难。传统方法试图通过多帧时序模块或光流、相机参数等先验信息来提升时间一致性,但这些方法面临内存占用高、在动态或不规则运动下性能下降以及对运动理解有限等问题。本文提出STATIC模型,该模型无需额外信息即可独立学习静态与动态区域的时间一致性。通过表面法线生成的差异掩模,可基于方向方差识别静态与动态区域。对于静态区域,掩模静态(MS)模块通过聚焦稳定区域来增强时间一致性;对于动态区域,表面法线相似性(SNS)模块通过度量帧间特征相似性来对齐区域并提升时间一致性。最终的精炼步骤整合了独立学习的静态与动态区域,使STATIC能够在整个序列中实现时间一致性。本方法在KITTI和NYUv2数据集上无需额外信息即实现了最先进的视频深度估计性能。