Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training. The TAPNext model and code can be found at https://tap-next.github.io/.
翻译:视频中的任意点追踪(TAP)是一个具有挑战性的计算机视觉问题,在机器人学、视频编辑和三维重建等领域已展现出诸多应用前景。现有的TAP方法严重依赖于复杂的、针对追踪任务设计的归纳偏置和启发式规则,这限制了其通用性和规模化潜力。为应对这些挑战,我们提出了TAPNext,一种将TAP任务转化为序列化掩码令牌解码的新方法。我们的模型具有因果性,以纯在线方式进行追踪,并消除了针对追踪任务的特定归纳偏置。这使得TAPNext能够以极低的延迟运行,并移除了当前许多先进追踪器所需的时间窗口机制。尽管方法简洁,TAPNext在在线与离线追踪器中均实现了新的最先进追踪性能。最后,我们通过实验证明,许多广泛使用的追踪启发式规则能够通过端到端训练在TAPNext中自然涌现。TAPNext模型与代码可在 https://tap-next.github.io/ 获取。