Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models were evaluated on different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of objects. To close this gap, we design a benchmark with four data sets of varying complexity and seven additional test sets featuring challenging tracking scenarios relevant for natural videos. Using this benchmark, we compare the perceptual abilities of four object-centric approaches: ViMON, a video-extension of MONet, based on recurrent spatial attention, OP3, which exploits clustering via spatial mixture models, as well as TBA and SCALOR, which use explicit factorization via spatial transformers. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking than the spatial transformer based architectures. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios despite their synthetic nature, suggesting that our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
翻译:从物体的角度认识世界,并通过时间跟踪这些天体是推理和场景理解的关键前提。最近,为不受监督地学习以物体为中心的表示方式提出了几种方法。但是,由于这些模型是在不同的下游任务上进行评估的,因此仍然不清楚这些模型如何在基本感知能力方面进行比较,例如探测、地表分层和跟踪物体。为了缩小这一差距,我们设计了一个基准,有四套不同复杂程度的数据和另外七套测试,这些基准具有与自然视频相关的具有挑战性的追踪情景。我们利用这一基准,比较了四种以物体为中心的方法的认知能力:Vimonto,即基于经常性空间关注的MONet的视频扩展,OP3,利用空间混合模型的集群,以及TBA和SCALOR,利用空间变压器的明确的因子化。我们的结果表明,具有未经控制的潜在代表结构在物体的探测、分解和跟踪方面比以空间变异器为基础的结构更有说服力。我们还观察到,没有任何一种方法能够精准地处理最具挑战性的追踪目标,尽管其性质是合成的,表明我们的基准可以提供更坚实的录像学习。