Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOTA) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored. In this paper, we aim to further unleash the power of Transformer by proposing a simple yet efficient fully-attentional tracker, dubbed SwinTrack, within classic Siamese framework. In particular, both representation learning and feature fusion in SwinTrack leverage the Transformer architecture, enabling better feature interactions for tracking than pure CNN or hybrid CNN-Transformer frameworks. Besides, to further enhance robustness, we present a novel motion token that embeds historical target trajectory to improve tracking by providing temporal context. Our motion token is lightweight with negligible computation but brings clear gains. In our thorough experiments, SwinTrack exceeds existing approaches on multiple benchmarks. Particularly, on the challenging LaSOT, SwinTrack sets a new record with 0.713 SUC score. It also achieves SOTA results on other benchmarks. We expect SwinTrack to serve as a solid baseline for Transformer tracking and facilitate future research. Our codes and results are released at https://github.com/LitingLin/SwinTrack.
翻译:最近,在跟踪和展示最新艺术(SOTA)业绩方面,对最近变异器进行了广泛探索,主要在跟踪和展示最新艺术(SOTA)业绩方面进行了大量探索,然而,目前的努力主要侧重于刺激和增强 convolucial神经网络(CNNNs)生成的特征。变异器在代表性学习方面的潜力仍然探索不足。在本文中,我们的目标是进一步释放变异器的力量,在典型的Siamees框架内,提议一个简单而高效的全视追踪器,称为SwinTrack,在SwinTrack中,在SwinTrack中利用变异器结构,为跟踪工作提供比纯CNN或CNN-Transexex框架更好的特征互动。此外,为了进一步加强稳健性,我们展示了一个新的动作标志,将历史目标轨迹嵌入通过提供时间环境改进跟踪。我们的运动标志是轻量的,但能带来明显的收益。在我们的全面实验中,SwinTrack在多个基准上超越了现有的方法。尤其是关于挑战性的LSwinTrack,SwinTrack为SUC得分的新记录,它也实现了新记录,在其他基准上取得了SOTARTT/RV/RVL的基线上。