计划练习：通过在潜在空间中组合目标以实现高效在线微调 (Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space)

General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach configurable goals for a wide range of tasks on command. However, such goal-conditioned policies are notoriously difficult and time-consuming to train from scratch. In this paper, we propose Planning to Practice (PTP), a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve. Our approach is based on two key ideas. First, we decompose the goal-reaching problem hierarchically, with a high-level planner that sets intermediate subgoals using conditional subgoal generators in the latent space for a low-level model-free policy. Second, we propose a hybrid approach which first pre-trains both the conditional subgoal generator and the policy on previously collected data through offline reinforcement learning, and then fine-tunes the policy via online exploration. This fine-tuning process is itself facilitated by the planned subgoals, which breaks down the original target task into short-horizon goal-reaching tasks that are significantly easier to learn. We conduct experiments in both the simulation and real world, in which the policy is pre-trained on demonstrations of short primitive behaviors and fine-tuned for temporally extended tasks that are unseen in the offline data. Our experimental results show that PTP can generate feasible sequences of subgoals that enable the policy to efficiently solve the target tasks.

翻译：通用型机器人需要具备各种行为序列的操作技能，以便在现实且不规则的环境中完成挑战性任务。针对这个问题，目标导向强化学习的目标是通过命令来获取能够达到可配置目标的策略，以胜任广泛的任务。然而，这样的目标导向策略因难以训练而备受诟病，也需要耗费大量时间。在本文中，我们提出了一种名为“计划练习”（PTP）的方法，该方法在实践中可以训练针对需要多种不同交互类型才能解决复杂、远见的任务的目标导向策略。我们的方法基于两个关键思想。首先，我们通过高层次规划器以潜在空间中的有条件子目标生成器来将具有层次结构的目标实现问题分解为多个子目标实现问题，而这些问题将被低层次、无模型策略实现。其次，我们提出了一种混合方法，该方法首先通过离线强化学习对有条件子目标生成器和策略进行预训练，然后通过在线探索对策略进行微调。这个微调过程本身就是由计划好的子目标促成的，将原始目标任务分解为短期内目标实现任务，这些任务显著容易学习。我们在模拟环境和实际环境中进行了实验，在这些实验中，策略在短时基本行为上预先训练，并通过微调来完成离线数据中未见的时间扩展任务。我们的实验结果表明，PTP可以生成可行的子目标序列，使策略能够高效地解决目标任务。