Representing a scene and its constituent objects from raw sensory data is a core ability for enabling robots to interact with their environment. In this paper, we propose a novel approach for scene understanding, leveraging a hierarchical object-centric generative model that enables an agent to infer object category and pose in an allocentric reference frame using active inference, a neuro-inspired framework for action and perception. For evaluating the behavior of an active vision agent, we also propose a new benchmark where, given a target viewpoint of a particular object, the agent needs to find the best matching viewpoint given a workspace with randomly positioned objects in 3D. We demonstrate that our active inference agent is able to balance epistemic foraging and goal-driven behavior, and outperforms both supervised and reinforcement learning baselines by a large margin.
翻译:原始感官数据中的场景及其组成对象代表了原始感官数据中的场景及其组成对象,这是使机器人能够与环境互动的核心能力。在本文中,我们提出一种新颖的场景理解方法,利用一个等级式的物体中心基因化模型,使代理人能够利用主动推断、神经激励的动作和感知框架,推断物体类别,并在一个偏心参照框架中出现。为了评价一个活跃的视觉剂的行为,我们还提出了一个新的基准,根据某一特定物体的目标视角,该代理人需要找到一个工作空间与3D中随机定位物体的最佳匹配观点。我们证明,我们活跃的推断剂能够平衡显性成形和目标驱动的行为,并且能够以大幅度的外差在监督和加强学习基线方面都表现得更好。