Learning from Demonstrations (LfD) via Behavior Cloning (BC) works well on multiple complex tasks. However, a limitation of the typical LfD approach is that it requires expert demonstrations for all scenarios, including those in which the algorithm is already well-trained. The recently proposed Learning from Interventions (LfI) overcomes this limitation by using an expert overseer. The expert overseer only intervenes when it suspects that an unsafe action is about to be taken. Although LfI significantly improves over LfD, the state-of-the-art LfI fails to account for delay caused by the expert's reaction time and only learns short-term behavior. We address these limitations by 1) interpolating the expert's interventions back in time, and 2) by splitting the policy into two hierarchical levels, one that generates sub-goals for the future and another that generates actions to reach those desired sub-goals. This sub-goal prediction forces the algorithm to learn long-term behavior while also being robust to the expert's reaction time. Our experiments show that LfI using sub-goals in a hierarchical policy framework trains faster and achieves better asymptotic performance than typical LfD.
翻译:通过Behavior Cloning(BC)从演示(LfD)中学习(LfD),在多重复杂任务中行之有效。然而,典型的LfD方法的局限性是,它要求所有情况,包括算法已经受过良好训练的情况,都进行专家演示,而最近提出的“从干预(LfI)中学习(LfI)”通过使用专家监督员克服了这一限制。专家监督员只有在怀疑即将采取不安全行动时才进行干预。虽然LfI大大改进了LfD, 最先进的LfI没有考虑到专家反应时间造成的延误,只学会短期行为。我们通过下列办法解决这些限制:(1) 将专家干预时间推回,和(2) 将政策分为两个层次,一个为未来产生次级目标,另一个为达到所期望的次级目标产生行动。这个次级目标预测迫使算法学习长期行为,同时对专家的反应时间保持稳健。我们的实验表明,LfI在等级政策框架内使用次级目标比典型的Ltimical-ratial 要求更快和取得更好的业绩。