Despite the rapid development of large 3D models, classical optimization-based approaches dominate the field of visual odometry (VO). Thus, current approaches to VO heavily rely on camera parameters and many handcrafted components, most of which involve complex bundle adjustment and feature-matching processes. Although disregarded in the literature, we find it problematic in terms of both (1) speed, that performs bundle adjustment requires a significant amount of time, and (2) scalability, as hand-crafted components struggle to learn from large-scale training data. In this work, we introduce a simple yet efficient architecture, Visual Odometry Transformer (VoT), that formulates monocular visual odometry as a direct relative pose regression problem. Our approach streamlines the monocular visual odometry pipeline in an end-to-end manner, effectively eliminating the need for handcrafted components such as bundle adjustment, feature matching, or camera calibration. We show that VoT is up to 4 times faster than traditional approaches, yet with competitive or better performance. Compared to recent 3D foundation models, VoT runs 10 times faster with strong scaling behavior in terms of both model sizes and training data. Moreover, VoT generalizes well in both low-data regimes and previously unseen scenarios, reducing the gap between optimization-based and end-to-end approaches.
翻译:尽管大型3D模型发展迅速,但经典的基于优化的方法仍在视觉里程计(VO)领域占据主导地位。因此,当前视觉里程计方法严重依赖相机参数和大量人工设计的组件,其中大多数涉及复杂的束调整和特征匹配过程。尽管文献中对此有所忽视,但我们发现其在(1)速度方面存在问题——执行束调整需要大量时间,以及(2)可扩展性方面存在问题——人工设计的组件难以从大规模训练数据中学习。在本研究中,我们提出了一种简单而高效的架构——视觉里程计Transformer(VoT),将单目视觉里程计表述为直接的相对位姿回归问题。我们的方法以端到端方式简化了单目视觉里程计流程,有效消除了对束调整、特征匹配或相机标定等人工设计组件的需求。实验表明,VoT比传统方法快达4倍,同时具有竞争性或更优的性能。与最近的3D基础模型相比,VoT在模型大小和训练数据方面均表现出强大的扩展性,且运行速度快10倍。此外,VoT在低数据环境和先前未见场景中均表现出良好的泛化能力,缩小了基于优化的方法与端到端方法之间的差距。