Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth
翻译:视频中的深度估计对于实际应用中的视觉感知至关重要。然而,现有方法要么依赖简单的逐帧单目模型,导致时间不一致性和不准确性,要么采用计算量大的时间建模,不适用于实时应用。这些限制显著制约了实际场景中的通用性和性能。为解决此问题,我们提出了VeloDepth,一种高效且鲁棒的在线视频深度估计流程,能有效利用先前深度预测的时空先验并执行深度特征传播。我们的方法引入了一种新颖的传播模块,该模块通过基于光流的扭曲结合学习到的残差修正来优化和传播深度特征及预测。此外,我们的设计在结构上强制时间一致性,从而在连续帧间实现稳定的深度预测并提升效率。在多个基准测试上的全面零样本评估表明,VeloDepth在时间一致性方面达到最先进水平,同时具备有竞争力的准确性,且推理速度显著快于现有的基于视频的深度估计器。因此,VeloDepth为适用于多样化感知任务的实时深度估计提供了一个实用、高效且准确的解决方案。代码和模型可在 https://github.com/lpiccinelli-eth/velodepth 获取。