为同时发现复合和复合政策进行等级强化学习 (Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies)

A common strategy to deal with the expensive reinforcement learning (RL) of complex tasks is to decompose them into a collection of subtasks that are usually simpler to learn as well as reusable for new problems. However, when a robot learns the policies for these subtasks, common approaches treat every policy learning process separately. Therefore, all these individual (composable) policies need to be learned before tackling the learning process of the complex task through policies composition. Such composition of individual policies is usually performed sequentially, which is not suitable for tasks that require to perform the subtasks concurrently. In this paper, we propose to combine a set of composable Gaussian policies corresponding to these subtasks using a set of activation vectors, resulting in a complex Gaussian policy that is a function of the means and covariances matrices of the composable policies. Moreover, we propose an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy. The algorithm is built on a maximum entropy RL approach to favor exploration during the learning process. The results of the experiments show that the experience collected with the compound policy permits not only to solve the complex task but also to obtain useful composable policies that successfully perform in their respective tasks. Supplementary videos and code are available at https://sites.google.com/view/hrl-concurrent-discovery .

翻译：处理复杂任务中昂贵的强化学习(RL)的共同战略是将这些复杂任务分解成一组子任务,这些子任务通常比较简单,可以学习,并且可以用于解决新的问题。然而,当机器人学习这些子任务的政策时,共同的方法将每个政策学习过程分开处理。因此,所有这些个别(可比较的)政策都需要学习,然后才能通过政策构成来处理复杂任务的学习过程。这种单项政策的构成通常按顺序进行,这不适合同时执行子任务的任务。在本文件中,我们提议结合一套与这些子任务相对应的可比较高斯政策,使用一套激活矢量的矢量来学习。导致复杂的高斯政策是每个可比较政策的手段和共变式矩阵的功能。此外,我们建议采用一个算法,通过利用复合政策产生的非政策数据,在同一学习过程中学习过程学习复合和可比较的政策。算法是建立在最大可折叠的 RL 方法上,以便在学习过程中支持探索这些子任务,而不能执行各自的变数/变数。实验的结果显示,在复杂的复合/变数中只收集到它们的变数政策。