多摄像头三维多目标跟踪的时空建模：站在过去和未来之间 (Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking)

This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning for Tracking" (PF-Track). Specifically, our method adapts the "tracking by attention" framework and represents tracked instances coherently over time with object queries. To explicitly use historical cues, our "Past Reasoning" module learns to refine the tracks and enhance the object features by cross-attending to queries from previous frames and other objects. The "Future Reasoning" module digests historical information and predicts robust future trajectories. In the case of long-term occlusions, our method maintains the object positions and enables re-association by integrating motion predictions. On the nuScenes dataset, our method improves AMOTA by a large margin and remarkably reduces ID-Switches by 90% compared to prior approaches, which is an order of magnitude less. The code and models are made available at https://github.com/TRI-ML/PF-Track.

翻译：本文提出了一种端到端的多摄像头三维多目标跟踪（MOT）框架。强调时空连续性，并集成了对跟踪对象的过去和未来推理。因此，我们将其命名为“过去和未来推理跟踪”（PF-Track）。具体而言，我们的方法采用“注意力跟踪”框架，使用对象查询在时间上一致地表示跟踪的实例。为了明确使用历史线索，我们的“过去推理”模块学习精细化的轨迹并通过交叉关注以前帧和其他对象的查询来增强对象特征。 “未来推理”模块消化历史信息并预测强大的未来轨迹。在长时间的遮挡情况下，我们的方法维护对象位置并通过整合运动预测来实现重新关联。在nuScenes数据集上，相比之前的方法，我们的方法大幅提高了AMOTA，并显著减少了ID-Switches，降低了10倍。代码和模型已在https://github.com/TRI-ML/PF-Track上提供。