Learning-from-demonstrations is an emerging paradigm to obtain effective robot control policies for complex tasks via reinforcement learning without the need to explicitly design reward functions. However, it is susceptible to imperfections in demonstrations and also raises concerns of safety and interpretability in the learned control policies. To address these issues, we use Signal Temporal Logic to evaluate and rank the quality of demonstrations. Temporal logic-based specifications allow us to create non-Markovian rewards, and also define interesting causal dependencies between tasks such as sequential task specifications. We validate our approach through experiments on discrete-world and OpenAI Gym environments, and show that our approach outperforms the state-of-the-art Maximum Causal Entropy Inverse Reinforcement Learning.
翻译:从演示中学习是一个新出现的范例,目的是通过强化学习获得对复杂任务的有效机器人控制政策,而无需明确设计奖励功能。然而,它容易在演示中出现不完善之处,并引起对所学控制政策安全和解释性的关切。为了解决这些问题,我们使用信号时空逻辑来评估和评定演示的质量。基于时间逻辑的规范允许我们创造非马尔科维亚的奖励,并界定诸如连续任务规格等任务之间的令人感兴趣的因果关系。我们通过在离散世界和开放AI Gym环境的实验来验证我们的做法,并表明我们的方法超越了最先进的最大岩气反强化学习。