By estimating 3D shape and instances from a single view, we can capture information about an environment quickly, without the need for comprehensive scanning and multi-view fusion. Solving this task for composite scenes (such as object stacks) is challenging: occluded areas are not only ambiguous in shape but also in instance segmentation; multiple decompositions could be valid. We observe that physics constrains decomposition as well as shape in occluded regions and hypothesise that a latent space learned from scenes built under physics simulation can serve as a prior to better predict shape and instances in occluded regions. To this end we propose SIMstack, a depth-conditioned Variational Auto-Encoder (VAE), trained on a dataset of objects stacked under physics simulation. We formulate instance segmentation as a centre voting task which allows for class-agnostic detection and doesn't require setting the maximum number of objects in the scene. At test time, our model can generate 3D shape and instance segmentation from a single depth view, probabilistically sampling proposals for the occluded region from the learned latent space. Our method has practical applications in providing robots some of the ability humans have to make rapid intuitive inferences of partially observed scenes. We demonstrate an application for precise (non-disruptive) object grasping of unknown objects from a single depth view.
翻译:通过从单一的视角估计 3D 形状和实例,我们可以快速地捕捉到关于环境的信息,而不必进行全面的扫描和多视图融合。 为复合场景(如物体堆叠)解决这一任务具有挑战性:隐蔽区域不仅形状模糊,而且还在实例分割方面; 多重分解可能是有效的。 我们观察到物理限制分解以及隐蔽区域的形状形状, 假设从物理模拟中构建的场景中获得的潜在空间可以作为更精确预测隐蔽区域的形状和实例的先导。 为此,我们提出一个深深修的自动电动物体(VAE), 接受物理模拟中堆叠的物体数据集的培训。 我们把实例分解设计作为中心投票任务, 允许对等级和隐蔽区域进行分解, 并且不需要设定场景中的最大物体数目。 在测试时, 我们的模型可以从单一深度的角度生成3D 形状和实例分解。 我们从深造对象区域提供某种精确的离心图像建议, 从所学的深深深深深的深度空间到某种不为人类的精确的图像。 我们的方法在所观测的视野中提供了某种实际应用。