As autonomous robots interact and navigate around real-world environments such as homes, it is useful to reliably identify and manipulate articulated objects, such as doors and cabinets. Many prior works in object articulation identification require manipulation of the object, either by the robot or a human. While recent works have addressed predicting articulation types from visual observations alone, they often assume prior knowledge of category-level kinematic motion models or sequence of observations where the articulated parts are moving according to their kinematic constraints. In this work, we propose FormNet, a neural network that identifies the articulation mechanisms between pairs of object parts from a single frame of an RGB-D image and segmentation masks. The network is trained on 100k synthetic images of 149 articulated objects from 6 categories. Synthetic images are rendered via a photorealistic simulator with domain randomization. Our proposed model predicts motion residual flows of object parts, and these flows are used to determine the articulation type and parameters. The network achieves an articulation type classification accuracy of 82.5% on novel object instances in trained categories. Experiments also show how this method enables generalization to novel categories and be applied to real-world images without fine-tuning.
翻译:由于自主机器人在家庭等现实世界环境中相互作用和导航,因此可靠地识别和操作门和柜子等显形物体是有用的。许多先前的物体显形识别工作需要机器人或人类对物体进行操纵。虽然最近的工作只涉及视觉观测的预示表达类型,但他们往往假定事先了解分类级运动模型或观测序列,而显示部件根据其运动受限而移动。在这项工作中,我们提议FormNet,即一个神经网络,从一个 RGB-D 图像和分离面罩的单一框架来识别对物体部件的对对立的表达机制。这个网络接受6个类别的149个显形物体的100公里合成图像的培训。合成图像通过光真化模拟器和域随机化来制作。我们提议的模型预测物体部分的运动剩余流动,这些流动用于确定表达类型和参数。这个网络在经过训练的新对象实例中实现了8.25 %的显像型分类精度。实验还表明,这种方法如何使小的类别能够被概括,并应用到真实的图像中,而没有微调。