In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount. We obtain comparable results to the state-of-the-arts on ImageNet while being computationally more efficient. We establish new state-of-the-arts on multiple video datasets, including Kinetics-400, Kinetics-600, Charades, and AViD. The code is available at: https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner
翻译:在本文中,我们引入了新颖的视觉代表学习,它依赖于少数适应性学习的符号,它适用于图像和视频理解任务。我们的方法不是依靠手工设计的分割战略来获得视觉符号和处理大量密集抽样的补丁以引起注意,而是学习视觉数据中的重要符号。这导致高效率和有成效地找到一些重要的视觉符号,并且能够在视频或图像中空间内容的更长的时空范围内对等关注进行建模。我们的实验显示,在图像和视频识别任务的若干具有挑战性的基准上,我们的实验表现非常出色。重要的是,由于我们的标志具有适应性,我们取得了大幅降低计算数量的竞争结果。我们在计算效率更高的同时,取得了与图像网络上最新艺术的可比结果。我们在多个视频数据集上建立了新的状态,包括Kinitics-400、Kinitics-600、Charades和AViD。该代码可在以下网址查阅: https://github.com/gogle-reearch/skenica/stree/maine/maine/maine/maines/main。