Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically require a pre-defined 3D model of the target and rely on a manually annotated segmentation mask in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limitations, we propose STORM (Segment, Track, and Object Re-localization from a single iMage), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.
翻译:精确的6D位姿估计与跟踪是机器人等物理人工智能系统的核心能力。然而,现有方法通常需要预先定义目标物体的3D模型,并依赖首帧人工标注的分割掩码,这不仅耗时费力,且在面临遮挡或快速运动时性能下降。为突破这些局限,我们提出STORM(基于单幅图像的分割、跟踪与物体重定位),这是一个无需人工标注、开源且鲁棒的实时6D位姿估计系统。STORM采用结合视觉-语言理解与特征匹配的三阶段创新流程:上下文物体描述指导定位,自交叉注意力机制识别候选区域,并生成精确的掩码与3D模型以实现精准位姿估计。另一关键创新是我们的自动重注册机制,该机制通过特征相似性监测检测跟踪失败,并从严重遮挡或快速运动中恢复。STORM在包含多物体遮挡、高速运动和光照变化的复杂工业数据集上实现了最先进的精度,同时以实时速度运行且无需额外训练。这种免标注方法显著降低了部署成本,为柔性制造与智能质检等现代应用提供了实用解决方案。