社会机器人学的变异元加强学习 (Variational Meta Reinforcement Learning for Social Robotics)

With the increasing presence of robots in our every-day environments, improving their social skills is of utmost importance. Nonetheless, social robotics still faces many challenges. One bottleneck is that robotic behaviors need to be often adapted as social norms depend strongly on the environment. For example, a robot should navigate more carefully around patients in a hospital compared to workers in an office. In this work, we investigate meta-reinforcement learning (meta-RL) as a potential solution. Here, robot behaviors are learned via reinforcement learning where a reward function needs to be chosen so that the robot learns an appropriate behavior for a given environment. We propose to use a variational meta-RL procedure that quickly adapts the robots' behavior to new reward functions. As a result, given a new environment different reward functions can be quickly evaluated and an appropriate one selected. The procedure learns a vectorized representation for reward functions and a meta-policy that can be conditioned on such a representation. Given observations from a new reward function, the procedure identifies its representation and conditions the meta-policy to it. While investigating the procedures' capabilities, we realized that it suffers from posterior collapse where only a subset of the dimensions in the representation encode useful information resulting in a reduced performance. Our second contribution, a radial basis function (RBF) layer, partially mitigates this negative effect. The RBF layer lifts the representation to a higher dimensional space, which is more easily exploitable for the meta-policy. We demonstrate the interest of the RBF layer and the usage of meta-RL for social robotics on four robotic simulation tasks.

翻译：随着机器人在日常生活环境中的存在不断增加,社会技能的提高至关重要。然而,社会机器人仍然面临着许多挑战。一个瓶颈是机器人行为需要经常适应,因为社会规范在很大程度上取决于环境。例如,机器人比办公室的工人更仔细地在医院病人周围运行,与办公室的工人相比,机器人应比在医院工作更加谨慎。在这项工作中,我们调查元强化学习(meta-RL)作为一种潜在的解决方案。在这里,机器人行为是通过强化学习学习学习,需要选择一种奖励功能,以便机器人为特定环境学习适当的行为。我们建议使用一个变换的元机器人行为程序,使机器人行为迅速适应新的奖赏功能。因此,鉴于新的环境不同奖励功能可以快速地在医院工作周围绕着病人。在这种工作中,我们通过强化学习一种矢量化的奖励功能和一种以这种代表性为条件的元政策。从新的奖赏功能的观察,程序确定其代表性和元政策的条件。在调查程序能力的能力时,我们意识到使用一种变换的元值程序程序,我们意识到它会快速地使机器人的行为适应新的奖赏功能。,从一个部分的机层上显示我们的标准值的比值的比值值值值,我们更低的比值, 。我们更低的比值只是的比值值的比值的比值的比值的比值的比值的比值只是的比值的比值的比值的比值的比值要要低的比值, 。