OPFormer：基于基础模型与几何编码的物体姿态估计方法 (OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding)

We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation with a versatile onboarding process. Our pipeline begins with an onboarding stage that generates object representations from either traditional 3D CAD models or, in their absence, by rapidly reconstructing a high-fidelity neural representation (NeRF) from multi-view images. Given a test image, our system first employs the CNOS detector to localize target objects. For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose. The core of OPFormer is a transformer-based architecture that leverages a foundation model for robust feature extraction. It uniquely learns a comprehensive object representation by jointly encoding multiple template views and enriches these features with explicit 3D geometric priors using Normalized Object Coordinate Space (NOCS). A decoder then establishes robust 2D-3D correspondences to determine the final pose. Evaluated on the challenging BOP benchmarks, our integrated system demonstrates a strong balance between accuracy and efficiency, showcasing its practical applicability in both model-based and model-free scenarios.

翻译：我们提出了一种统一的端到端框架，该框架通过灵活的物体注册流程，将物体检测与姿态估计无缝集成。我们的流程始于一个注册阶段，该阶段可从传统的三维CAD模型生成物体表征；若无CAD模型，则通过多视角图像快速重建高保真神经辐射场（NeRF）来生成。给定测试图像，系统首先使用CNOS检测器定位目标物体。针对每个检测结果，我们新颖的姿态估计模块OPFormer推断其精确的6D姿态。OPFormer的核心是一个基于Transformer的架构，它利用基础模型进行鲁棒的特征提取。该模块通过联合编码多个模板视图，独特地学习全面的物体表征，并利用归一化物体坐标空间（NOCS）引入显式的三维几何先验来增强这些特征。解码器随后建立鲁棒的2D-3D对应关系以确定最终姿态。在具有挑战性的BOP基准测试中评估表明，我们的集成系统在精度与效率之间实现了良好的平衡，展现了其在基于模型和无模型场景中的实际应用潜力。