The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset.
翻译:在动态视觉场景中发现这种构成结构对于终端到终端的计算机视觉方法具有挑战性,除非提供明确的实例级监督。 借助运动提示的基于软体模型最近显示在学习如何在没有直接监督的情况下代表、分割和跟踪物体方面有很大的希望,但是它们仍然未能推广到复杂的现实世界多对象视频中。 为了缩小这一差距,我们从人类发展中汲取灵感,并虚构了以深度信号为形式的场景几何学信息能够促进以对象为中心的学习。 我们引入了SAVVI+,这是一个以对象为中心的视频模型模型模型模型模型,通过进一步利用最佳的做法,我们能够将SAVVI+培养到以移动相机录制的复杂动态片段,其中含有自然背景上不同外观的静态和移动对象,而不需要分解监督。 最后,我们证明,通过使用从LDAR获得的微深度信号, SAVI++能够从现实世界路径数据中学习缓冲对象断段和跟踪视频。