Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.
翻译:在强化学习方面,特别是在环境的外部回报很少甚至完全被忽略的任务方面,有效的探索仍然是一个具有挑战性的问题。基于内在动机的重大进步在简单环境中显示了有希望的结果,但往往被困在多式和随机动态的环境中。在这项工作中,我们提议了一个基于模拟多式联运和随机性的有条件的变异推论的变异动态模型。我们认为环境状态行动过渡是一种有条件的基因化过程,在目前状态、行动和潜在变异的条件下生成了下一个国家的预测。我们从环境转型的负日志相似性中获取了一个上限,并使用了诸如内在勘探奖赏这样的上限,使代理人能够在不观察外部奖励的情况下通过自我监督的勘探学习技能。我们评估了若干基于图像的模拟任务和真正的机器人操纵任务的拟议方法。我们的方法超越了若干基于环境模型的状态式勘探方法。